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Abstract 

Throughout  history,  handwriting  has  been  the  primary  means  of  recording  information 
that  is  persevered  across  both  time  and  space.  With  the  coming  of  the  electronic  docu¬ 
ment  era,  we  are  challenged  with  making  an  enormous  amount  of  handwritten  documents 
available  for  electronic  access.  Though  many  handwritten  documents  contain  only  hand¬ 
writing,  now,  more  are  mixed  with  printed  text,  noise,  and  background  patterns.  The 
mixture  of  handwriting  with  other  components  presents  a  great  challenge  for  making  an 
original  document  electronically  accessible. 

Many  handwritten  documents  come  together  with  a  special  background  pattern,  rule 
lines,  which  are  printed  on  the  paper  to  guide  writing.  After  digitization,  rule  lines 
will  touch  text  and  cause  problems  for  further  document  image  analysis  if  they  are  not 
detected  and  removed.  In  this  dissertation,  we  present  a  rule  line  detection  algorithm 
based  on  hidden  Markov  model  (HMM)  decoding,  achieving  both  high  detection  accuracy 
and  a  low  false  alarm  rate.  After  detection,  line  removal  is  performed  by  line  width 
thresholding. 

Handwriting  often  mixes  with  printed  text,  such  as  signatures  and  annotations  on  a 
business  letter.  Handwriting  in  a  printed  document  often  indicates  corrections,  additions, 
or  other  supplemental  information  that  should  be  treated  differently  from  the  main  con¬ 
tent.  The  data  set  we  are  processing  is  noisy,  which  makes  the  problem  more  challenging. 
In  this  dissertation,  we  first  segment  the  document  at  a  suitable  level,  and  then  classify 
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each  segmented  block  as  machine  printed  text,  handwriting,  or  noise.  Markov  random 
held  (MRF)  based  post-processing  is  exploited  to  refine  the  classification  results. 

The  identified  handwriting  may  be  further  analyzed.  In  this  dissertation,  we  propose 
a  novel  point-pattern  based  handwriting  matching  technique  and  apply  it  for  handwriting 
synthesis  and  retrieval.  We  formulate  point  matching  as  an  optimization  problem  trying 
to  preserve  the  local  neighborhood  structures.  After  establishing  the  correspondence 
between  two  handwriting  samples,  we  warp  one  sample  toward  the  other  using  the  thin 
plate  spline  (TPS)  deformation  model  to  synthesize  new  handwriting  samples.  We  also 
apply  our  matching  algorithm  for  handwriting  retrieval  since  it  is  much  easier  to  define 
robust  features  based  on  the  matching  results. 

Keywords:  Handwritten  Document  Analysis,  Rule  Line  Detection,  Handwriting  Match¬ 
ing,  Shape  Matching,  Handwriting  Synthesis,  Handwriting  Retrieval 
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1  Introduction 


Handwriting  was  developed  as  a  means  to  expand  human  memory  and  to  facilitate  com¬ 
munication.  It  has  changed  tremendously  over  time  when  new  writing  tools  are  invented. 
The  latest  change  is,  that  with  the  acceptance  of  new  technologies  such  as  personal  dig¬ 
ital  assistants  (PDAs)  and  cellular  phones,  handwriting  can  be  collected  on-line  without 
losing  temporal  information,  which  opens  a  great  opportunity  for  the  analysis  of  hand¬ 
writing,  such  as  handwriting  recognition  and  signature  verification.  New  technologies 
also  challenge  the  persistence  of  handwriting.  For  example,  the  oldest  books  are  hand- 
copied.  However,  the  printing  press  and  typewriters  opened  up  the  world  of  formatted 
documents  and  made  scriptoria  obsolete.  Almost  all  books  in  the  past  several  centuries 
have  been  machine  printed.  Recently,  computer  and  communication  technologies  such  as 
word  processors,  fax  machines,  and  e-mails  provide  new  ways  to  expand  human  memory 
as  well  as  facilitate  communication.  In  this  perspective,  one  may  ask:  Will  handwriting 
be  threatened  with  extinction? 

All  these  inventions  have  led  to  the  fine-tuning  and  reinterpreting  of  handwriting. 
With  the  increase  of  literacy,  more  and  more  people  learn  to  read  and  write.  As  a 
general  rule,  as  the  length  of  the  handwritten  message  decreases,  the  number  of  people 
using  handwriting  increases  [1].  Widespread  acceptance  of  digital  computers  seemingly 
challenges  the  future  of  handwriting.  However  in  numerous  situations,  pen  and  paper 
provides  more  convenience  than  a  keyboard.  For  example,  most  students  still  do  not  type 
lecture  notes  on  a  notebook  computer.  They  record  language,  equations,  and  graphs  with 
a  pen.  Many  people  still  prefer  to  keep  a  hard  copy  of  documents,  even  when  electronic 
versions  are  available.  They  make  annotations  on  a  document  when  they  are  reading. 
Also,  handwriting  is  demanded  by  law  such  as  signatures  on  legal  documents.  This 
brings  a  new  challenge  to  process  such  documents  where  handwritten  annotations  mix 
with  machine  printed  contents.  Since  the  segmentation  and  recognition  techniques  for 
machine  printed  text  and  handwriting  are  very  different,  different  contents  should  be 
identified  before  further  processing. 

Documents  are  the  result  of  a  set  of  physical  processes  and  conditions,  and  the  result¬ 
ing  document  can  be  viewed  as  consisting  of  layers,  such  as  handwriting,  machine  printed 
text,  background  pattern,  figures,  tables,  and/or  noise.  Fig.  la  shows  how  several  layers 
combine  to  generate  a  document.  Not  all  layers  are  present  in  a  single  document.  The 
dashed  arrows  in  the  figure  mean  that  the  corresponding  components  are  missed  in  this 
example.  Document  analysis  reverses  these  processes  to  segment  a  document  into  layers 
with  different  physical  and  semantic  properties.  This  procedure  is  shown  in  Fig.  lb. 
Since  the  1960s,  much  research  on  document  processing  has  been  done  based  on  optical 
character  recognition  (OCR).  A  more  general  study  of  document  analysis,  such  as  page 
(or  zone)  segmentation,  zone  classification,  and  table  detection,  began  in  early  1980s. 
After  more  than  two  decades  of  research,  automatic  machine  printed  text  segmentation 
and  recognition  for  clean  documents  can  be  viewed  as  a  solved  problem  with  commercial 
products  on  the  market.  However,  much  work  needs  to  be  done  for  handwriting,  such 
as  separating  handwriting  from  machine  printed  text,  segmentation  and  recognition  of 
handwriting. 

The  study  of  handwriting  covers  a  broad  field,  dealing  with  numerous  aspects  of  this 
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Figure  1:  (a)  A  document  generation  model,  (b)  A  document  image  analysis  model 


complex  task.  It  involves  research  concepts  from  several  disciplines:  experimental  psy¬ 
chology  [2],  computer  science  [3],  education  [4],  and  forensic  document  examination  [5]. 
For  computer  processing  of  handwriting  there  are  several  types  of  analysis,  recognition, 
and  interpretation  associated  with  it.  Handwriting  recognition  transforms  the  spatial 
form  of  graphical  marks  into  symbolic  representation.  Signature  identification  deter¬ 
mines  the  author  of  a  sample  from  a  set  of  individuals.  Signature  verification  determines 
whether  the  signature  belongs  to  a  given  person. 

This  dissertation  presents  our  approach  to  identifying  the  handwriting  layer  in  a 
document  image  from  other  layers  such  as  background  patterns,  noise,  and  machine 
printed  text.  After  handwriting  identification,  we  propose  an  approach  to  handwriting 
matching  that  can  be  applied  for  handwriting  synthesis  and  retrieval. 

1.1  Rule  Line  Detection 

Many  handwritten  documents  come  together  with  a  special  background  pattern:  rule 
lines.  These  rule  lines  are  printed  on  the  paper  to  guide  writing.  After  digitization 
they  will,  however,  touch  text  and  cause  problems  for  further  document  analysis  such  as 
segmentation  and  recognition.  These  lines  must  be  detected  and  removed  before  the  text 
is  fed  to  an  optical  character  recognition  (OCR)  engine.  These  rule  lines  may  appear 
severely  broken  since  they  are  thin  and  printed  with  a  light  color.  Many  line  detection 
algorithms  have  been  proposed  in  the  literature  [6-11].  They  work  well  on  relatively 
clean  documents  with  solid  or  mildly  broken  lines,  but  performance  will  significantly 
deteriorates  if  lines  are  severely  broken  because  of  low  image  quality  or  if  they  mix, 
touch,  and  overlap  with  text.  It  is  difficult,  if  not  impossible,  to  reliably  detect  these  lines 
individually.  Another  challenge  involves  character  strokes,  which  may  lie  on  the  same 
line,  causing  a  high  false  alarm  rate.  If  the  false  alarm  lines  are  removed,  character  strokes 
may  be  removed  erroneously.  A  line  detection  algorithm  with  both  a  high  accuracy  and 
a  low  false  alarm  rate  should  be  developed  for  severely  broken  rule  lines. 

To  handle  these  problems,  the  context  is  often  required  to  refine  the  initial  detec¬ 
tion.  For  example,  in  form  processing  most  form  cells  are  rectangular.  In  known  form 
processing  the  number  of  lines  and  the  gaps  among  these  lines  can  be  used  as  a  priori 
knowledge  and  stored  as  references  in  form  templates.  These  ideas  have  been  presented 
in  previous  work  to  improve  detection  accuracy  and  reduce  false  alarms  [11-14],  But  the 
usage  of  a  priori  knowledge  in  the  above  applications  is  ad  hoc  and  lacks  a  systematic 
representation. 

In  this  dissertation,  we  present  a  rule  line  detection  algorithm  based  on  hidden  Markov 
model  (HMM)  decoding.  After  skew  estimation  and  correction,  we  perform  a  horizontal 
projection.  An  HMM  model  is  used  to  model  the  projection  profile,  and  the  positions  of 
all  rule  lines  are  detected  simultaneously  after  HMM  decoding  with  the  Viterbi  algorithm. 
Experiments  on  a  real  data  set  show  that  our  algorithm  achieves  both  a  high  detection 
accuracy  and  a  low  false  alarm  rate.  After  detection,  lines  are  removed  by  line  width 
thresholding. 
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1.2  Handwriting  Identification  in  Noisy  Documents 

Handwriting  often  mixes  with  machine  printed  text.  Handwriting  in  a  machine  printed 
document  often  indicates  corrections,  additions,  or  other  supplemental  information  that 
should  be  treated  differently  from  the  main  content.  The  segmentation  and  recognition 
techniques  required  for  machine  printed  and  handwritten  text  differ  significantly.  There¬ 
fore,  identification  of  handwriting  from  machine  printed  text  is  crucial  for  the  following 
document  image  analysis. 

The  data  set  we  are  processing  is  noisy,  which  makes  the  problem  more  challenging. 
Large  (e.g.,  marginal  black  strips)  and  small  noise  components  (e.g.,  pepper-and-salt 
noise)  can  be  removed  reliably  with  some  simple  rules  [15,  16].  It  is,  however,  hard 
to  discriminate  noise  from  compatible  sized  text.  In  this  dissertation,  we  treat  noise 
as  a  distinguished  class.  We  first  segment  the  document  at  a  suitable  level,  and  each 
segmented  block  is  classified  into  machine  printed  text,  handwriting,  or  noise. 

Some  work  has  been  done  on  handwriting/machine  printed  text  identification.  The 
classification  is  typically  performed  at  the  text  line  [17-20],  word  [21],  or  character 
level  [22,23].  Special  consideration  must  be  given  to  the  size  of  the  region  being  seg¬ 
mented  before  we  can  perform  any  classification.  The  smallest  unit  used  for  classification 
is  called  the  pattern  unit.  If  the  unit  is  too  small,  the  information  contained  in  it  may  not 
be  sufficient  for  classification;  if  it  is  too  large,  however,  different  types  of  components 
may  be  mixed  in  the  same  region  due  to  segmentation  errors.  In  previous  work,  we  con¬ 
ducted  a  performance  evaluation  for  the  accuracy  of  machine  printed  text/handwriting 
distinguish  at  the  character,  word,  and  zone  levels,  and  showed  that  a  reliable  classifica¬ 
tion  can  be  achieved  at  the  word  level  [23].  Several  features,  such  as  Gabor  filter  features, 
run-length  histogram  features,  crossing  counts  histogram  features,  and  texture  features, 
are  extracted  to  identify  each  segmented  block  into  machine  printed  text,  handwriting, 
or  noise. 

Several  classifiers,  such  as  the  Fisher  linear  discriminant  classifier,  the  k- nearest  neigh¬ 
bor  (k-NN)  classifier,  and  the  support  vector  machine  (SVM)  classifier,  are  tested  in  our 
comparison  experiments.  They  have  similar  performance  with  reasonable  accuracy.  If 
the  machine  printed  text  block  is  too  small  (such  as  words  with  less  than  3  characters), 
it  is  likely  to  be  classified  as  noise.  Some  noise  blocks  are  classified  as  handwriting  due  to 
the  overlapping  in  the  feature  space  of  these  two  classes.  Machine  printed  text,  handwrit¬ 
ing,  and  noise  exhibit  different  patterns  of  geometric  relationships.  For  example,  printed 
words  often  form  horizontal  (or  vertical)  text  lines,  while  noise  blocks  tend  to  overlap 
each  other.  Markov  random  field  (MRF)  is  used  to  model  such  geometric  relationships  to 
refine  the  classification  results.  Experiments  show  MRF  is  effective  in  modeling  the  ge¬ 
ometric  dependency  of  neighboring  components,  about  half  of  the  mis-classifications  are 
corrected  after  post-processing.  After  identification,  noise  can  be  removed  from  the  doc¬ 
ument,  which  enhances  a  degraded  document.  Machine  printed  text  can  be  sent  for  zone 
segmentation  or  recognition  with  any  off-the-shelf  OCR  package.  Identified  handwriting 
can  be  sent  for  further  analysis,  such  as  recognition,  retrieval,  signature  verification  or 
identification. 
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1.3  Handwriting  Matching,  Synthesis,  and  Retrieval 

The  identified  handwriting  may  be  sent  for  further  analysis.  In  this  dissertation,  we 
propose  a  novel  handwriting  matching  technique  and  apply  it  for  handwriting  synthesis 
and  retrieval.  We  study  handwriting  matching  in  a  broader  context  of  nonrigid  shape 
matching  using  a  set  of  points  uniformly  sampled  from  the  handwriting  skeleton.  For 
nonrigid  shapes,  most  neighboring  points  cannot  move  independently  under  deformation 
due  to  physical  constraints.  Therefore,  though  the  absolute  distance  between  two  points 
may  change  significantly,  the  neighborhood  of  a  point  is  well  preserved  in  general.  Based 
on  this  observation,  we  formulate  point  matching  as  an  optimization  problem  to  preserve 
local  neighborhood  structures  during  matching.  Our  formulation  has  a  simple  graph 
matching  interpretation,  where  each  point  is  a  node  in  the  graph,  and  two  nodes  are 
connected  by  an  edge  if  they  are  neighbors.  The  optimal  match  between  two  graphs  is 
the  one  that  maximizes  the  number  of  matched  edges  (i.e.,  the  number  of  neighborhood 
relations).  The  shape  context  distance  is  used  to  initialize  the  graph  matching,  followed 
by  relaxation  labeling  for  refinement.  Experiments  demonstrate  the  effectiveness  of  our 
approach:  it  outperforms  the  shape  context  [24]  and  TPS-RPM  [25]  algorithms  under 
nonrigid  deformation  and  noise  on  a  public  data  set. 

The  performance  of  a  statistical  pattern  recognition  system  depends  heavily  on  the 
size  and  quality  of  the  training  set.  Although  it  is  easy  to  prepare  samples  of  machine 
printed  text,  doing  so  is  expensive  for  handwriting.  Synthesized  data  can  be  used  as  a 
supplement.  The  key  problem  of  handwriting  synthesis  is  generating  samples  that  look 
natural.  Otherwise,  arbitrarily  synthesized  samples  cannot  improve  (if  not  deteriorate) 
the  performance  of  the  system  trained  on  them.  Although  handwriting  samples  vary 
greatly  in  respect  to  size,  rotation,  and  stroke  width,  shape  is  generally  used  to  categorize 
them  into  different  classes.  Since  nonrigid  deformation  of  handwriting  is  large,  we  argue 
that  a  synthesis  algorithm  should  learn  the  shape  deformation  characteristics  from  real 
handwriting  samples.  It  is  reasonable  to  assume  that  the  shape  space  of  handwriting 
with  the  same  content  (e.g.,  the  handwriting  samples  of  the  letter  ‘a’)  is  continuous.  For 
characters  with  several  different  writing  glyphs,  such  as  number  ‘7,’  we  may  need  to  do 
clustering  analysis  to  segment  the  shape  space  into  multiple  continuous  sub-space.  Given 
two  handwriting  samples  close  in  the  shape  space,  an  interpolation  between  them  is  likely 
to  lie  inside  the  shape  space  too  (this  is  guaranteed  if  the  shape  space  is  convex).  That 
means,  given  two  real  similar  handwriting  samples,  it  is  reasonable  to  assume  some  person 
may  write  with  a  shape  between  them  (i.e.,  with  similar  but  less  degree  of  deformation). 
In  this  dissertation,  we  propose  an  example-based  handwriting  synthesis  approach  using 
two  training  samples.  We  use  our  handwriting  matching  algorithm  to  establish  the 
correspondence  between  two  handwriting  samples.  After  handwriting  matching,  we  warp 
one  sample  toward  the  other  using  the  thin  plate  spline  (TPS)  deformation  model.  By 
adjusting  the  regularization  parameter  of  the  TPS  deformation  model,  we  can  control 
the  amount  of  nonrigid  deformation  in  synthesis. 

Another  application  of  handwriting  matching  is  handwriting  retrieval.  Recently, 
shape  context  [24]  was  proposed  as  an  effective  tool  for  shape  recognition  and  retrieval. 
In  this  approach,  the  point  correspondence  is  estimated,  and  similarity  measures  are  de¬ 
fined  based  on  the  matching  result  for  shape  retrieval.  By  replacing  the  original  shape 
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matching  method  with  our  more  robust  approach,  we  achieve  moderate  improvement. 
To  further  improve  the  accuracy,  we  propose  new  similarity  measures,  such  as  a  mea¬ 
sure  based  on  the  affine  transformation,  registration  residual  errors,  and  outlier  ratio 
estimated  by  the  matching  algorithm.  Using  more  similarity  measures  will  significantly 
improve  the  retrieval  accuracy.  A  more  effective  way  to  improve  accuracy  is  to  use  multi¬ 
ple  query  samples.  We  propose  a  simple  but  effective  way  to  combine  the  retrieval  results 
using  multiple  query  instances. 

1.4  Organization  of  the  Dissertation 

This  dissertation  is  organized  as  following:  Rule  line  detection  and  removal  is  described 
in  detail  in  the  next  chapter.  Our  model-based  line  detection  algorithm  is  not  limited  to 
handwriting  document  analysis.  In  Chapter  3,  we  apply  it  for  known  form  processing. 
In  Chapter  4,  we  present  our  approach  to  identify  handwriting  and  machine  printed  text 
in  noisy  document  images.  Our  handwriting  matching  approach  is  described  in  Chap¬ 
ter  5.  We  discuss  the  application  of  handwriting  matching  to  handwriting  synthesis  and 
handwriting  retrieval  in  Chapters  6  and  7,  respectively.  This  dissertation  concludes  with 
a  brief  summary  of  our  contributions,  and  some  discussions  of  the  remaining  problems. 


2  Rule  Line  Detection 


2.1  Introduction 

Many  handwritten  documents  come  with  a  special  background  pattern,  rule  lines,  as 
shown  in  Fig.  2a.  It  is  important  that  these  lines  are  detected  and  removed  before  the 
text  goes  to  an  optical  character  recognition  (OCR)  engine.  In  this  chapter,  we  focus  on 
rule  line  detection. 

Many  line  detection  algorithms  have  been  proposed  in  the  literature  [6-11].  They  work 
well  on  relatively  clean  documents  with  solid  or  mildly  broken  lines,  but  the  performance 
will  deteriorate  significantly  if  lines  are  severely  broken  due  to  the  low  image  quality,  or 
if  they  mix,  touch,  and  overlap  with  text.  Fig.  2b  shows  the  line  detection  results  for  a 
rule-lined  document  using  the  directional  singly-connected  chain  (DSCC)  method  [11]. 
We  can  see  only  a  few  lines  are  partially  detected  due  to  severe  brokenness.  It  is  very 
difficult,  if  not  impossible,  to  reliably  detect  these  lines  individually. 

In  this  dissertation,  we  propose  a  model-based  method  which  incorporates  context  to 
detect  parallel  lines  optimally  and  systematically.  Under  the  model,  lines  are  detected  by 
a  hidden  Markov  model  (HMM)  decoding  process,  which  can  determine  the  positions  of 
all  lines  simultaneously.  Rather  than  detecting  lines  directly  on  original  images  [7,10,11], 
we  use  a  DSCC-based  scheme  to  filter  text  as  a  preprocessing  step  so  the  interference 
with  text  can  be  minimized.  We  then  use  a  coarse-to-fine  approach  to  estimate  the  skew 
angle  of  the  document.  After  deskewing  the  document,  we  perform  horizontal  vertical 
projection.  Rather  than  treating  the  peaks  in  the  projection  profile  as  the  positions  [7,10], 
we  model  the  projection  profile  with  an  HMM  model  so  the  context  among  these  lines 
can  be  incorporated.  The  Viterbi  algorithm  is  then  used  to  search  the  optimal  positions 
of  these  lines  simultaneously  from  the  projection  profile.  The  experimental  results  show 
our  method  is  robust.  It  can  detect  lines  with  a  high  accuracy  and  a  low  false  alarm 
rate  in  degraded  documents.  Fig.  2c  shows  the  line  detection  result  using  the  proposed 
model-based  approach.  Compared  to  Fig.  2b,  our  detection  result  is  much  better.  Our 
model-based  parallel  line  detection  algorithm  is  flexible;  therefore,  it  can  be  easily  adapted 
for  different  applications.  We  will  demonstrate  it  on  two  applications:  rule  line  detection 
and  known  form  processing.  In  this  chapter  we  focus  on  rule  line  detection,  and  the 
application  on  known  form  processing  will  be  discussed  in  the  next  chapter. 

After  line  detection,  we  would  like  to  remove  the  detected  lines  without  deteriorating 
the  text.  Many  line  removal  algorithms  have  been  developed  in  the  literature,  and  can 
be  classified  into  two  categories.  One  kind  of  approach  tries  to  remove  lines  completely, 
then  uses  local  property  of  overlapping  areas,  such  as  stroke  direction  and  connection,  to 
restore  the  missing  parts  of  strokes  [8,26-28].  One  problem  of  these  approaches  is  that, 
after  line  removal,  a  large  quantity  of  useful  information  is  lost,  making  stroke  recovery 
difficult.  The  other  kind  of  approach  analyzes  character-line  overlapping  areas,  then 
removes  pixels  only  belonging  to  lines,  while  preserving  those  belonging  to  characters  [29- 
32],  During  my  graduate  work  at  Tsinghua  University  in  China,  I  proposed  a  line  width 
thresholding  based  approach  [33,34],  The  line  width  is  preserved  well  if  no  character¬ 
line  touching  happens,  but  increases  noticeably  in  overlapping  areas.  This  property  is 
used  for  line  removal.  We  first  decompose  a  line  to  an  array  of  run-lengths  of  black 
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Figure  2:  Rule  line  detection  and  removal,  (a)  A  rule-lined  document;  (b)  Line  detection 
results  using  the  DSCC  method  [11];  (c)  Line  detection  results  using  our  model  based 
approach;  (d)  Cleaned  image  after  line  removal  (the  black  marginal  strips  are  removed 
too). 

pixels,  predicated  to  the  running  direction  of  the  line.  If  a  run-length  is  shorter  than  a 
threshold,  it  is  removed;  otherwise,  it  is  preserved.  An  adaptive  scheme  is  used  to  set 
a  small  threshold  for  the  area  close  to  text,  and  a  large  one  for  the  area  far  apart  from 
text.  Fig.  2d  shows  the  line  removal  result.  As  we  can  see,  this  approach  works  well 
for  most  character- line  touching  cases.  In  this  chapter,  we  focus  on  rule  line  detection, 
please  refer  to  our  publication  [33]  for  details  of  our  line  removal  approach. 

The  remainder  of  this  chapter  is  organized  as  follows.  In  Section  2.2,  we  briefly  review 
previous  work  on  line  detection  with  emphasis  on  the  usage  of  prior  knowledge.  Since 
text  may  touch  or  overlap  with  lines,  removing  text  before  line  detection  will  significantly 
increase  the  robustness  of  our  line  detection  algorithm.  In  Section  2.3,  we  present  our  text 
filtering  method.  Our  general  model-based  parallel  line  detection  algorithm  is  described 
in  detail  in  Section  2.4,  which  can  be  tailored  for  rule  line  detection  (Section  2.5)  or 
known  form  processing  (discussed  in  the  next  chapter).  We  demonstrate  the  robustness 
of  our  approach  with  experiments  in  Section  2.6,  and  the  chapter  concludes  in  Section  2.7 
with  a  discussion  of  future  work. 

2.2  Related  Work  on  Line  Detection 

Line  detection  is  widely  used  in  form  detection  and  interpretation  [8,10,11],  engineer¬ 
ing  graph  interpretation  [35],  bank  check/invoice  processing  [13,14],  and  optical  music 
recognition  (OMR)  [36].  Among  many  algorithms  proposed  in  the  literature,  the  Hough 
transform  method  and  its  variations  are  widely  used  [6,37].  The  Hough  transform  method 
converts  the  global  pattern  detection  problem  in  an  image  space  to  a  local  pattern  (ide¬ 
ally  a  point)  detection  problem  in  a  transformed  parameter  space.  To  detect  a  straight 
line,  each  black  pixel  ( x,y )  in  an  image  space  is  transformed  into  a  sinusoidal  curve  in 
the  Hough  parameter  space 

p  =  xcos9  +  ysind.  (1) 

After  transformation,  collincar  points  (;Cj,  y.j)  in  the  image  space  intersect  at  a  point  (p,  9) 
in  the  Hough  parameter  space.  Therefore,  a  peak  in  the  transformed  space  provides  strong 
evidence  that  a  corresponding  straight  line  exists  in  the  image.  The  Hough  transform 
method  can  detect  dashed  and  mildly  broken  lines.  However,  it  is  very  time  consuming. 
To  reduce  computational  costs,  a  projection  method  was  proposed  [7]  to  detect  form 


10 


frame  lines  by  limiting  the  search  orientations  since  only  horizontal  and/or  vertical  lines 
usually  exist  in  form  documents.  The  method  deskews  the  document  first,  then  detects 
the  peaks  on  the  horizontal  and  vertical  projection  profiles  as  lines.  It  can  be  viewed  as 
a  special  case  of  the  Hough  transform  method  by  searching  9  only  around  0°  and  90°. 
The  method  will  fail  if  the  projection  of  a  line  does  not  form  a  peak  on  the  profile  when 
it  mixes  with  text,  the  estimated  skew  angle  is  not  accurate  enough,  or  the  lines  are  too 
short  or  severely  broken.  Chen  and  Lee  [10]  proposed  the  strip  projection  method  to 
alleviate  this  problem  since  lines  are  more  likely  to  form  peaks  on  the  projection  profile 
in  a  small  region.  For  horizontal  line  detection,  they  first  divided  an  image  into  several 
vertical  strips  of  equal  width,  and  then  performed  horizontal  projection  in  each  strip. 
The  detected  collinear  line  segments  in  each  strip  are  linked  to  form  the  line. 

Thinning  is  another  common  method  to  extract  lines.  It  uses  an  iterative  boundary 
erosion  process  to  remove  outer  pixels  until  only  a  skeleton  of  pixel  chains  remains  [38]. 
It  can  maintain  connectivity,  but  also  tends  to  create  noisy  junctions  at  corners,  inter¬ 
sections,  and  branches.  Medial  line  methods,  on  the  other  hand,  extract  image  contours 
first.  The  mid-points  of  two  parallel  contour  lines  then  form  a  medial  line  [39].  The 
methods  may  miss  pairs  of  contour  lines  at  branches,  requiring  post-processing  to  reduce 
this  distortion  [40].  The  result  of  either  thinning  or  medial  line  method  is  a  chain  of 
pixels,  and  a  line  segment  can  be  detected  by  approximating  the  pixel  chain.  The  sparse 
pixel  vectorization  (SPV)  algorithm,  proposed  by  Dori  et  al.  [9],  does  not  use  contours 
to  get  medial  lines.  It  traces  the  medial  points  of  consecutive  horizontal  or  vertical  pixel 
runs  until  constraints  are  violated.  Each  continuous  trace  represents  a  bar  or  an  arc. 
SPV  often  achieves  better  results  than  other  medial  line  methods,  but  the  medial  point 
tracking  procedure  is  complicated,  and  often  needs  post-processing  to  refine  the  results. 

Run-lengths  are  often  used  as  an  image  component  to  detect  lines.  Yu  and  Jain  [8] 
proposed  a  data  structure,  called  block  adjacency  graph  (BAG),  to  represent  an  image. 
BAG  is  defined  as  Q(N,E),  where  N  is  a  set  of  block  nodes  and  E  is  a  set  of  edges 
indicating  the  connection  between  two  nodes.  Each  node  is  a  block  which  contains 
either  one  or  several  horizontal  run-lengths  adjacently  connected  in  the  vertical  direction 
and  aligned  on  both  left  and  right  sides  within  a  given  tolerance.  A  line  is  detected 
by  searching  a  connected  sub-graph  in  the  BAG  with  large  aspect  ratio.  Chhabar  et 
al.  [41]  presented  another  run-length-based  approach  for  horizontal  line  detection.  Since 
the  method  is  composed  of  four  steps:  filter,  assemble,  silhouette,  and  threshold,  they 
named  it  the  FAST  algorithm.  The  algorithm  works  directly  on  run-length  encoded 
images  and  is  very  fast.  It  was  later  extended  to  detect  lines  with  any  orientation  after 
implementing  an  efficient  rotation  operation  on  run-length  coded  images  [42],  Recently, 
the  directional  singly-connected  chain  (DSCC)  method  was  proposed  [11].  A  DSCC  is  a 
chain  of  run-lengths  which  are  singly  connected.  A  basic  characteristic  of  a  line  is  that 
it  runs  in  only  one  direction.  Run-lengths  perpendicular  to  the  direction  of  a  line  are 
merged  into  a  DSCC.  When  a  junction  is  encountered,  the  merging  process  stops  and 
a  new  DSCC  generates.  Each  DSCC  represents  a  line  segment,  and  multiple  collinear 
DSCCs  may  be  merged  into  a  line,  based  on  pre-defined  rules.  In  the  above  approaches, 
the  grouping  of  run-length  into  line  segments  is  rule-based.  A  model-based  method, 
using  the  Kalman  filter,  was  proposed  [43].  Assuming  that  a  run-length  (perpendicular 
to  the  line’s  running  direction)  of  constant  length  moves  along  a  straight  line,  the  Kalman 
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filtering  technique  is  used  to  track  the  run-length.  If  the  tracking  error  is  larger  than  a 
threshold,  it  is  stopped,  and  a  new  tracking  begins. 

In  some  applications,  horizontal  and  vertical  lines  always  intersect  each  other.  This 
property  can  be  used  to  develop  an  efficient  algorithm  by  detecting  intersections  of  hor¬ 
izontal  and  vertical  lines  first.  The  verification  of  line  segments  between  intersections 
complete  the  algorithm  [42,44], 

In  most  of  the  above  approaches,  domain  specific  knowledge  is  used  implicitly  or 
explicitly.  For  example,  parameters  of  a  line  detection  algorithm  may  be  tuned  to  a 
specified  application.  In  engineering  drawing  interpretation,  knowing  the  line  type  (such 
as  solid,  dotted,  or  dashed)  helps  to  develop  a  robust  line  detection  algorithm  [45].  In 
some  forms,  most  cells  are  rectangular.  This  knowledge  can  be  used  to  improve  detection 
accuracy  and  reduce  false  alarms  [11].  In  [46],  Roach  and  Tatem  demonstrated  the 
effectiveness  of  domain  specific  knowledge  in  a  highly  structured  domain:  handwritten 
music  score  recognition.  But  the  use  of  the  prior  knowledge  in  above  applications  is  ad 
hoc  and  lacks  systematic  representation. 

2.3  Preprocessing 

Preprocessing  has  two  purposes:  first,  we  deskew  the  document  so  the  parallel  lines 
are  oriented  horizontally  or  vertically;  second,  we  filter  text  strokes  to  diminish  their 
intervention  in  line  detection.  The  skew  of  a  document  can  be  estimated  using  the 
text  [47],  or  using  the  extracted  line  segments  if  lines  are  available  on  the  document  [48]. 
In  our  approach,  we  use  a  coarse-to-ffire  line  based  skew  estimation  method,  which  is 
similar  to  [48] .  Since  skew  estimation  is  a  mature  technique  in  document  image  analysis, 
we  will  not  discuss  the  details  in  this  section.  After  skew  estimation,  we  can  easily  rotate 
the  document  to  correct  the  skew.  In  this  section,  we  focus  on  text  filtering,  which  is  one 
of  our  contributions.  We  extract  directional  singly-connected  chains  (DSCC)  first,  then 
remove  DSCCs  unlikely  to  be  generated  by  a  line  segment  because  of  their  shapes. 


2.3.1  Definition  of  DSCC 

We  define  two  types  of  DSCCs:  horizontal  and  vertical,  as  described  in  [11],  Take  the 
horizontal  DSCC  for  example.  A  horizontal  DSCC,  C/,,  consists  of  a  black  pixel  run- 
length  array  R\R-2  •  •  •  Rm,  where  R{  is  a  vertical  run-length  with  one  pixel  width 


Ri(xi,ysi,yei ) 


p{x,  V )  =  1,  for  x  =  Xi,y  e  [ysh  yet \ 
and  p(xi,  ysi  -  1)  =  p(xh  yei  +  1)  =  0 


(2) 


where  p(x,  y)  is  the  value  of  pixel  (x,  y)  with  1  representing  black  pixels,  and  0  repre¬ 
senting  white  pixels;  x*,  ys^  and  ye j  designate  x,  starting  y,  and  ending  y  coordinates 
of  Ri,  respectively.  Two  neighboring  run-lengths  R,  and  Ri+ 1  are  merged  into  a  DSCC 
if  they  are  singly  connected  in  the  horizontal  direction.  As  shown  in  Fig.  3a,  the  single 
connection  means  that  at  each  side  of  R,t(l  <  i  <  m),  there  is  one  and  only  one  con¬ 
nected  run-length.  In  this  example,  R1R2  ■  ■  ■  R7,  R11R12R13,  Rs,  R9,  Rio ,  Ru  and  R15 
are  extracted  as  DSCCs.  The  definition  of  the  vertical  singly-connected  chain,  Cv,  is 
similar. 
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Figure  3:  Definition  and  extraction  of  horizontal  DSCCs.  (a)  Illustration  of  horizontal 
DSCCs.  (b)  Extracted  DSCCs  (represented  in  gray)  where  a  text  stroke  crosses  a  line. 

The  most  important  property  of  a  line  is  the  single  connection  along  its  running 
direction.  An  ideal  line  consists  of  only  one  DSCC.  A  real  line  often  consists  of  multiple 
collinear  DSCCs.  Fig.  3b  shows  an  example  of  extracted  DSCCs  (represented  in  gray) 
of  a  text  stroke  crossing  a  line.  We  can  see  the  line  breaks  into  several  line  segments 
(DSCCs)  on  the  touching  area.  If  the  image  quality  is  reasonable,  then  a  line  can  be 
detected  by  merging  DSCCs  with  similar  orientation  [11],  In  onr  case,  we  use  it  to  remove 
text  and  preserve  line  segments. 

2.3.2  Text  Filtering 

As  shown  in  Fig.  3b,  a  DSCC  can  be  a  text  stroke  or  a  line  segment.  We  observed  that  a 
line  segment  often  has  a  smaller  variation  from  the  desired  orientation  and  larger  aspect 
ratio.  We  use  an  ellipse  to  model  the  shape  of  a  DSCC,  and  calculate  the  orientation  9 , 
the  first  and  second  axes  a  and  b  of  each  DSCC  as  follows: 

Vmn  =  -  x)m(y  -  y)np(x,  y)  (3) 


x  y 


where  p(x,  y )  represents  a  pixel  in  the  DSCC,  x  and  y  are  the  means  of  x  and  y  coor¬ 
dinates,  and  umn  is  a  central  moment.  For  horizontal  line  detection,  we  only  preserve 
those  DSCCs  with  either  very  small  sizes  (max{a,6}  <  Tf)  or  large  aspect  ratios  within 
a  specified  orientation  ( a/b  >  T2  and  6  G  [—45°, 45°]).  Tf  and  T2  are  thresholds  deter¬ 
mined  experimentally.  The  first  condition  preserves  small  DSCCs,  which  may  be  parts 
of  a  broken  line  or  the  touching  areas  of  lines  and  text;  and  the  second  preserves  large 
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Figure  4:  DSCC-based  text  filtering,  (a)  and  (b)  A  document  image  with  rule  lines  and 
the  corresponding  result  of  text  filtering  in  the  horizontal  direction,  (c)  A  form  document 
image,  (d)  and  (e)  Results  of  text  filtering  in  the  horizontal  and  vertical  directions  of 
(c),  respectively. 

DSCCs,  which  are  likely  to  be  horizontal  line  segments.  For  rule  line  detection,  we  need 
to  perform  text  filtering  only  in  the  horizontal  direction.  Our  approach  can  be  extended 
for  vertical  line  detection,  as  discussed  in  the  next  chapter  on  known  form  processing. 
For  vertical  line  detection,  similar  filtering  conditions  exist  except  for  the  orientation. 
Fig.  4  shows  examples  of  text  filtering.  We  can  see  that  most  text  strokes  are  filtered 
and  the  line  segments  are  preserved. 

2.4  HMM-Based  Parallel  Line  Detection 

In  the  following  description,  we  use  horizontal  line  detection  as  an  example  to  illustrate 
the  proposed  method.  The  extension  to  vertical  line  detection  is  straightforward.  After 
skew  correction  and  text  filtering,  we  perform  a  horizontal  projection  and  detect  lines 
on  the  projection  profile.  A  stochastic  model,  M(yi,y2,  ■  ■  ■ ,  y n ),  is  proposed  for  a  group 
of  parallel  lines,  where  N  is  the  number  of  lines,  and  y^i  =  1,2,...,  iV,  is  the  vertical 
position  of  the  ith  line  on  the  projection  profile.  The  line  gap  gt  between  two  neighboring 
lines  i  and  i  +  1  is  defined  as 

9i  =  Vi+ 1  -  Vi-  (7) 

A  global  image  registration  method  (such  as  affine  transformation  or  projective  trans¬ 
formation)  cannot  compensate  for  local  distortions  introduced  in  photocopying  and  scan¬ 
ning.  Such  local  distortions  will  introduce  variations  to  the  vertical  line  positions  y^s 
on  the  horizontal  projection  profile.  Kanungo  and  Haralick  [49]  found  that  the  variation 
of  the  position  of  a  point  is  as  large  as  four  pixels  after  removing  the  global  projective 
deformation.  Therefore,  the  variation  of  the  distance  between  two  points  will  be  within 
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the  range  [-8,  +8]  pixels,  if  the  variations  of  two  points  are  independent.  Considering  the 
case  that  documents  may  be  bent,  folded  and  un-folded,  or  they  may  be  stored  in  various 
environmental  conditions  (e.g.,  hot,  cold,  dry  or  humid)  for  years,  the  local  distortions 
in  scanned  images  may  be  larger.  In  our  experiments,  we  found  the  maximum  variation 
of  r/,  from  its  mean  value  can  reach  up  to  11  pixels.  It  is  hard  to  model  the  dependency 
among  the  variations  of  g):’s.  As  a  simplification,  in  our  approach,  we  do  not  consider 
such  dependency.  Then,  it  is  easy  to  show  that  the  line  positions  y.,’s  form  a  Markov 
chain  under  this  simplified  assumption.  As  they  are  not  observable  directly,  an  HMM  is 
more  suitable  for  modeling  the  projection  profile.  The  line  positions  can  be  detected  by 
decoding  the  HMM. 

2.4.1  Hidden  Markov  Model 

The  Markov  property  of  a  sequence  of  events  is  well  studied  in  the  literature  [50].  Consider 
a  system  that  stays  at  one  of  a  set  of  N  distinct  states,  Si,  S2, . . . ,  Sn,  at  any  sampling 
time  t.  The  system  undergoes  a  change  of  state  according  to  a  set  of  probabilities 
associated  with  the  state  during  the  period  between  two  successive  sampling  times.  For 
a  Markov  chain  (the  first  order),  the  probability  distribution  of  qt  only  depends  on  the 
value  of  the  previous  state  qt_ \ 

P[<k  =  Skt\qt- 1  =  Skt_1,  qt-2  =  Skt_  2,  ...,qi  =  Sfcl]  =  P[qt  =  Skt\qt- 1  =  S^]  (8) 

If  the  state  transition  probability  is  independent  of  time  t,  then  the  Markov  chain  is  said 
to  be  homogeneous 

P[qt  =  Sj\qt-i  =  Si\  =  dij  1  <i,j<N  (9) 

We  can  show  that  line  positions  {Yj,i  =  1,2,...,  iV}  form  a  Markov  chain,  if  the 
variations  in  line  gaps  are  independent.  We  use  uppercase  characters  to  represent  random 
variables  (e.g.,  Yj)  and  lowercase  characters  to  represent  the  value  of  the  random  variables 

(e-g.,  Vi). 

Theorem  1:  Let  Yt,  i  =  1,2,...,  At  be  line  positions,  and  Gi  =  YJ+i  —  Yj,i  = 
1, . . . ,  At  —  1  be  line  gaps.  If  {Gi}  are  independent,  then  {Yj}  form  a  Markov  chain. 

P(Yi |Yj,  Y'2, . . . ,  Yi_i)  =  P(Yi\Yi-i)  (10) 

Proof: 

P{Yi\Yi,Y2,...,Yi_i)  =  P(Gi-i  +  Yi_i\Yi,Y2,..,,Yi_i) 

=  P(Gi_i\Yi,  Y^, ... ,  Yi_i) 

—  P(Gi_i\Gi,  G2, . . . ,  Gj-2,  Yi_i)  (11) 


Since,  {Gj}  are  independent,  we  have 

P(Xi\Y\,Y2, . . . ,  Yi_i)  =  P(Gi_1|yi_1) 

=  PiGi-i  +  Yi^Y^) 

=  p(y|y-i)  (12) 
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Therefore,  {Yj}  form  a  Markov  chain. 

In  the  literature  of  random  process  [51],  {Yj}  is  called  an  independent  increment  pro¬ 
cess,  which  includes  several  well-known  random  processes,  such  as  the  Brownian  motion 
process  and  the  Poisson  process. 

In  many  applications,  the  actual  state  sequence  is  not  observable.  The  resulting  model 
(which  is  called  a  hidden  Markov  model)  is  a  doubly  embedded  stochastic  process  with 
an  underlying  stochastic  process  that  is  not  observable,  but  can  be  inferred  only  through 
another  stochastic  process  that  produces  the  sequence  of  observations.  The  elements  of 
a  standard  discrete  HMM  are 

1)  N ,  the  number  of  the  states  in  the  model. 

2)  M,  the  number  of  distinct  observation  symbols  per  state. 

3)  A  =  {ciij},  the  state  transition  probability  matrix. 

4)  B  =  {bij},  the  probability  distribution  matrix  of  the  observation  symbols. 

5)  7 r,  the  initial  state  distribution. 

HMMs  can  model  some  1-D  signals  well,  and  have  achieved  great  success  in  speech  [50] 
and  handwriting  recognition  [3]. 

In  our  application,  we  can  observe  only  the  projection  profile  h k 


P(Hk  —  hk  |Yi  —  yi, . . .  ,YN  —  yN ) 


P(Hk  =  k  —  A  line  is  on  k 

P(Hk  =  k  ^  yi)  No  lines  are  on  k 

(13) 


Therefore,  the  projection  profile  can  be  modeled  with  an  HMM.  A  standard  HMM  is 
shown  in  Fig.  5a,  where  St  and  Sb  are  the  states  representing  top  and  bottom  image 
borders,  Sl,u  i  —  1,  2, . . . ,  N,  represents  lines,  and  Sg,u  i  —  1,  2, . . . ,  N  —  1,  represents  the 
gaps  between  lines  i  and  i  +  1. 

One  weakness  of  conventional  HMMs  is  modeling  of  the  state  duration.  The  inherent 
duration  probability  distribution  Pi(d),  d  =  1,2, ,  associated  with  state  Sci  is 


Pi(d)  =  ( au)d  X(1  -  ait) 


(14) 


where  an  is  a  self  transition  probability.  The  exponential  state  duration  distribution 
is  inappropriate  for  our  applications.  Instead  we  explicitly  model  the  duration  dis¬ 
tributions.  The  model  with  explicit  state  duration  is  shown  in  Fig.  5b  \  where  the 
stochastic  property  of  the  model  is  incorporated  into  the  state  duration  distributions 
Pr(d),  Psid),  Pi(d),i  =  1,  2, . . . ,  A^  —  1.  For  some  applications,  the  quality  of  the  model¬ 
ing  is  significantly  improved  when  explicit  state  duration  distributions  are  used  [52], 


2.4.2  HMM  Parameter  Estimation 

The  major  drawback  of  an  explicit  duration  HMM  is  that  it  significantly  increases  compu¬ 
tational  costs  for  model  training.  With  a  traditional  forward-backward  training  algorithm 
(a  type  of  EM  algorithm),  the  re-estimation  problem  for  a  variable  duration  HMM  is  more 

Ht  only  exactly  models  lines  with  one  pixel  width.  To  make  the  model  more  accurate,  alternatively, 
one  can  introduce  probability  of  durations  to  line  states.  Fortunately,  line  width  does  not  vary  too 
much  in  most  applications.  Experiments  shows  that  such  inaccuracy  in  modeling  does  not  deteriorate 
its  performance  noticeably. 
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Figure  5:  HMMs  for  a  projection  profile,  (a)  A  standard  HMM.  (b)  An  HMM  with 
explicit  state  duration. 

difficult  than  that  for  a  standard  HMM  [50].  Fortunately,  in  our  case,  we  can  directly 
derive  the  HMM  parameters  from  ground-truth  since  the  states  explicitly  correspond  to 
image  components.  Therefore,  the  forward-backward  training  algorithm  is  not  needed. 
We  set  duration  probabilities  of  states  St  and  Sb  to  uniform  distribution  within  a  range. 
The  duration  probabilities  of  states  Sc,i,i  =  1,  2, , . . ,  N  —  1,  is  estimated  directly  from 
the  ground-truth. 

The  observation  comes  from  the  projection  profile  hk ■  The  large  number  of  obser¬ 
vation  symbols  would  prevent  us  from  estimating  the  model  parameters  reliably  with 
limited  training  samples.  There  are  two  methods  to  reduce  the  number  of  parameters 
of  the  model.  One  involves  modeling  the  distribution  of  the  observation  as  a  Gaussian 
distribution  [50],  so  only  the  mean  and  variance  of  the  Gaussian  distribution  need  to 
be  estimated.  For  known  form  processing  (discussed  in  the  next  chapter),  we  find  the 
projections  of  a  line  over  multiple  form  instances  can  be  well  modeled  as  a  Gaussian  dis¬ 
tribution.  Another  method  quantizes  the  projection  profile  into  several  levels.  For  rule 
line  detection,  the  image  quality  varies  significantly  among  different  images.  The  distri¬ 
bution  of  the  observations  does  not  follow  a  Gauss  distribution.  Therefore,  we  quantize 
hk  into  K  levels  ( K  =  5  in  our  experiments  for  rule  line  detection).  The  probability  of 
each  level  is  estimated  from  the  ground-truth. 

The  HMM  parameters  estimated  directly  from  the  ground-truthed  data  set  are  not 
optimal  due  to  the  sparseness  of  the  training  data.  For  example,  some  entries  of  the  line 
gap  distribution  do  not  appear  or  appear  only  a  few  times.  Parameter  sharing,  a  technique 
used  in  neural  networks  to  train  the  parameters  with  limited  training  samples  [53,54], 
is  used  in  our  approach.  For  example,  we  let  non-line  states  St,  Sb,  Sq,i,  ■  ■  ■ ,  Sg,n-i 
share  the  same  observation  probability  distributions  since  the  observations  of  these  states 
are  the  same:  the  projections  of  noise  and  remaining  text  strokes  after  filtering.  For 
rule  line  detection,  we  further  combine  all  line  states  into  one  state,  and  all  non-line 
states  into  another  state,  significantly  reducing  the  parameters  of  the  model.  For  line 
gap  distribution  estimation,  we  assume  the  distribution  is  symmetric  around  the  mean 
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value.  Therefore,  data  smoothing  techniques,  originally  proposed  in  natural  language 
processing  [55],  can  be  used 


C\gt  +  k)  =  C\g%  -  k) 


c(a+q+cfe-j)  k  =  12 


(15) 


where  gt  is  the  mean  value  of  line  gap  Gj,  C(k)  is  the  number  of  instances  of  G,  with 
value  k  in  the  training  set,  and  C'(k)  is  the  smoothed  result  after  imposing  symmetric 
regularization.  Finally,  we  set  the  empty  entries  to  the  minimal  value  of  all  non-zero 
entries.  Suppose  the  maximal  variation  of  the  line  gap  G*  is  K.  For  k  G  [—K,K\,  the 
final  smoothed  result  is 


f  C\gt  +  k)  if  C'{§i  +  k)  +  0 

\  minie[-A',A1,C'(5i+i)#o  C'(gt  +  i)  if  C'(g,  +  k)  =  0 


(16) 


C"(k )  can  be  converted  to  probability  by  normalization. 

The  ultimate  goal  of  training  is  to  search  the  optimal  HMM  parameters  to  minimize 
the  line  detection  error.  The  estimated  parameters  from  the  training  data  can  produce 
reasonable  results,  but  they  do  not  minimize  the  pre-defined  line  detection  error  rate. 
Generally,  the  error  criterion  is  a  complex  function  of  the  model  parameters  without 
a  closed-form  representation.  A  direct  searching  algorithm  can  be  used  to  solve  such 
optimization  problems.  In  our  case,  the  simplex  search  method  proposed  by  Nelder  and 
Mead  [56]  is  used  to  minimize  the  detection  error. 


2.4.3  HMM  Decoding 

Given  the  observation  sequence  O  =  h k  —  1, 2, . . . ,  T,  and  the  HMM  A,  we  want  to 
search  an  optimal  state  sequence  Q  =  q\q2  ■  ■  ■  qr  to  maximize  P(Q\0,  A),  which  is  equiva¬ 
lent  to  maximizing  P(Q,  0|A).  Normally,  the  Viterbi  algorithm,  a  dynamic  programming 
method,  is  used  to  decode  HMMs.  A  matrix  v  with  dimension  T  x  (N  +  1)  is  defined 
and  updated  in  the  Viterbi  algorithm,  and 

v(t,  n)  =  max  P[qu  q2, . , . ,  qt  =  SL,n,  hu  h2, . . . ,  ht\\]  (17) 

qi,q2,...,qt-i 

is  the  best  decoding  score  at  time  t,  which  accounts  for  the  first  t  observations  and  ends 
in  state  Sl,u-  The  sequence  gi,  q2) . . . ,  qt- 1  maximizing  the  probability  in  Eq.  (17)  is  the 
best  decoding  result  until  time  t  if  we  decode  state  qt  as  the  nth  line. 

Suppose  the  minimal  and  maximal  state  durations  of  states  Sc,n  are  <5n-  and  5n+,  and 
the  durations  of  St  and  Sb  are  uniformly  distributed  in  [0,  Ax1]  and  [0,5b],  respectively. 
The  complete  procedure  of  decoding  is  stated  as  follows 

1.  Clear  all  entries  of  matrix  v. 

2.  For  1  <  i  <  St,  decode  the  first  i  —  1  observations  as  St  (the  top  image  border) 
and  observation  i  as  Sl,i 

l  izl 

^  —  jP(hi\  qi  =  SLA)Y[P(hj\qj  =  ST),  (18) 
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where  P(hj\qi  =  Sl ,i)  is  the  probability  of  observing  ht  if  the  system  enters  state 
Sl, i  at  time  i;  n}=i  P(hj\Qj  —  St)  is  the  probability  of  observing  the  first  i  —  1 
observations  if  the  system  stays  at  state  St  during  the  time  period  from  1  to  i  —  1; 
and  is  the  probability  of  the  model  staying  at  St  for  i  —  1  consecutive  periods. 

3.  Set  t  —  1. 

4.  For  n  —  1  to  N 

For  j  =  5n_  to  Sn+ 


t+j-i 

v'(t  +  j,n)  =  v(t,  n)Pn (j)P(ht+j \qt+j  =  SL, ,n+i)  P(hk\qk  =  SG,n)  (19) 

k=t+ 1 

v(t  +  j,n)  =  ma x{v(t  +  j,n),  v'(t  +  j,n )}  (20) 

End  loop  of  j 
End  loop  of  n 

Here  Pn(j )  is  the  probability  of  staying  at  state  Sa,n  with  j  consecutive  times; 
P(ht+j\qt+j  =  Sl,h+ i)  is  the  probability  of  observing  observation  ht+j  if  the  system 
enters  Sl,u+ i  at  time  t  +  j,  which  corresponds  to  a  new  line;  and  nttft+i  P(hk\Qk  — 
Sc,n )  is  the  probability  of  observing  sequence  ht+i  to  ht+j~ i  if  the  system  stays 
at  state  Sc,n  during  this  time  period,  which  corresponds  to  a  line  gap.  Eq.  (20) 
updates  the  optimal  partial  detection  result. 

5.  If  t  >  T  —  8b,  decode  the  following  sequence  as  the  bottom  image  border. 


v\T,  N  +  1) 
v(T,N+  1) 


$B  +  1 


jQ  P{hk\qk  —  Sb) 


v(t,N) 

On  ~ hi 

k=t+ 1 

N  +  1),  v'(T.  N  +  1)} 


(21) 

(22) 


6.  If  f  <  T,  then  t  —  t  +  1,  and  go  to  step  4. 


For  each  t,  the  algorithm  remembers  the  best  decoding  path  until  time  t.  After 
decoding, 

v{T,  N  +  1)  =  maxP(Q,0|A)  (23) 

Q 

is  the  probability  of  detecting  lines  given  the  model,  which  can  be  regarded  as  detection 
confidence.  The  sequence  qi,  g2,  •  •  • ,  Q’r  which  achieves  v(T,  N  + 1)  is  the  optimal  decoding 
result. 


2.4.4  Polyline  Representation 

After  identifying  the  vertical  position  of  a  line,  we  then  need  to  detect  the  left  and  right 
end  points  by  grouping  the  broken  line  segments  together.  For  each  detected  line,  those 
DSCCs  within  10  pixels  distance  to  the  detected  line  are  merged  [11]. 
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An  ideal  straight  line  can  be  represented  with  two  parameters  a  and  b  as  y  =  a.xx  +  b. 
Practically,  a  real  line  is  represented  with  points  (xi,  yj),i  —  1,  2, ,  n.  The  parameters 
a  and  b  can  be  estimated  based  on  the  minimum  mean  squared  error  criterion  (MMSE) 

n 

X  =  y,  Xi/n, 

i= 1 
n 

y  =  Y1  &/n> 

ELi  (x-x)(y-y) 

Zhix-xY  ’ 

b  =  y  —  a  x  x.  (24) 

For  most  straight  lines,  this  approximation  is  good  enough.  However,  due  to  the 
distortions  introduced  by  photocopying  and  scanning,  some  lines  are  cursive  and  cannot 
be  well  represented  by  two  end  points.  In  this  case,  a  polyline  representation  is  used  as 
follows: 

1.  Calculate  the  average  approximation  error  of  a  line 

hk  =  \yi-axxi-b\,  (25) 

n 

e  =  ^Syi/n.  (26) 

i—  1 

2.  If  e  is  smaller  than  the  average  line  width  (often  two  to  four  pixels),  keep  it  with 
two  end  points  representation  and  exit. 

3.  Otherwise,  split  the  whole  line  into  two  segments  from  the  middle  and  estimate  the 
line  parameters  a  and  b  for  each  segment  respectively,  as  described  in  Eq.  (24). 

4.  For  each  segment,  go  to  step  1  and  repeat. 

A  polyline  is  described  as  a  sequence  of  vertices  (Pi,  P?, ,  Pm).  Two  or  three  seg¬ 
ments  are  sufficient  to  represent  most  lines  in  our  following  experiments. 

2.5  Application  to  Rule  Line  Detection 

In  this  section  we  use  the  proposed  method  to  detect  severely  broken  rule  lines.  In 
this  application,  the  number  of  lines  is  unknown,  and  the  vertical  line  gaps  may  vary  in 
different  images  due  to  the  different  styles  used  by  rule-lined  paper  or  different  scanning 
resolutions.  However,  the  length  of  lines  and  the  vertical  line  gaps  are  roughly  consistent 
in  the  same  document  image. 


20 


\  Estimated  vertical  line  gap 


Figure  6:  Vertical  line  gap  estimation  for  rule  line  detection,  based  on  the  auto-correlation 
of  the  projection  profile. 


Pi(d) 


Figure  7:  A  simplified  HMM  for  rule  line  detection. 

2.5.1  Vertical  Line  Gap  Estimation 

We  need  to  estimate  the  average  vertical  line  gap  from  the  input  image.  Since  the  line 
gaps  between  neighboring  lines  are  roughly  the  same,  the  horizontal  projection  of  rule 
lines  is  a  periodic  signal  (the  period  is  the  average  vertical  line  gap  g).  We  use  an 
auto-correlation-based  approach  to  estimating  the  period  of  the  projection.  The  auto¬ 
correlation  of  a  signal  x,  with  n  samples  x(l),x(2),  ■  ■  ■  ,x(n),  is  defined  as 

n—l 

R(l)  —  x(i)x(i  +  l)  l  =  0, 1, . . . ,  n  —  1  (27) 

i—  1 

The  distance  between  the  first  two  peaks  of  the  auto-correlation  is  taken  as  the  vertical 
line  gap,  as  shown  in  Fig.  6. 

2.5.2  A  Simplified  Model 

In  order  to  reduce  the  complexity  of  the  model  (the  number  of  states  and  parameters), 
we  further  simplify  it  by  considering  the  special  properties  of  rule  lines.  Since  the  vertical 
line  gaps  and  the  lengths  of  rule  lines  are  roughly  consistent  in  the  same  document  image, 
we  can  merge  states  Sc,i,i  =  1,  2, . . . ,  N  —  1,  into  one  state  Sq,  and  Sl^  i  —  1,  2, . . . ,  N, 
into  another  state  Sl-  Fig.  7  shows  the  simplified  model.  State  merging  reduces  the 
number  of  parameters  significantly.  Another  advantage  of  such  simplification  is  that  we 
do  not  need  to  know  explicitly  the  number  of  lines  on  a  document. 
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Table  1:  Observation  probability  distribution  matrix  B  estimated  from  the  training  set 
containing  100  documents 


0 

1 

2 

3 

4 

Non-peak 

(“•SI 

(  W  W  ] 

'  16  ’  8  -1 

/  W  W  I 

V  8  ’  4  J 

GW] 

Sl 

106  (4.7%) 

246  (10.8%) 

378  (16.6%) 

1,051  (46.2%) 

493  (21.7%) 

St , Sb,Sq 

191,086  (98.8%) 

2,052  (1.1%) 

170  (0.1%) 

58  (0.03%) 

15  (0.008%) 

2.5.3  Parameter  Estimation 

In  our  data  set,  the  quality  of  different  images  varies  significantly  as  does  the  quality 
of  rule  lines  on  the  same  image.  Therefore,  we  cannot  use  the  Gaussian  distribution  to 
model  the  projections  of  rule  lines  (the  Gaussian  mixture  distributions  may  be  a  good 
approximation).  Instead,  we  quantize  the  observation  into  several  levels  and  estimate  the 
probability  of  each  quantized  level  directly  from  the  ground-truthed  data  set.  Peaks  on 
the  projection  profile  have  particular  significance  for  line  detection.  Therefore,  we  first 
set  all  non-peaks  on  the  profile  to  zero,  then  quantize  the  peaks  on  the  projection  profile 
into  four  levels  using  the  following  quantization  thresholds:  w/16,  w/8,  and  wj 4,  where 
w  is  the  image  width.  The  observation  probability  distribution  matrix  B ,  estimated  from 
the  training  set  containing  100  documents,  is  listed  in  Table  1.  We  let  states  St,  Sb,  and 
Sg,  whose  observations  are  the  projection  of  text  or  noise,  share  the  same  observation 
distribution.  We  observed  that  (1)  due  to  the  severe  brokenness,  the  horizontal  projec¬ 
tions  of  about  80%  of  rule  lines  are  less  than  1/4  of  the  image  width;  (2)  4.7%  of  rule  lines 
do  not  form  peaks;  and  (3)  the  peaks  with  small  heights  are  more  likely  formed  by  text 
strokes  or  noise  (2,052  instances)  rather  than  by  rule  lines  (246  instances).  Therefore, 
we  need  to  use  high  level  contextual  information  to  achieve  reasonable  detection  results 
for  these  severely  broken  lines. 

We  set  duration  probability  of  states  St  and  Sb  to  the  uniform  distribution  on  [0,  g  — 
1] .  The  duration  probability  of  state  So  is  estimated  directly  from  the  ground-truth  with 
the  approach  described  in  Section  2.4.2.  With  all  these  settings,  the  rule  line  detection 
accuracy  on  the  training  set  is  about  95.6%.  For  comparison,  the  accuracy  is  only  91.7%  if 
we  use  the  Gaussian  distribution  for  approximation.  Since  the  parameters  estimated  from 
the  training  data  are  not  optimal  for  the  ultimate  detection  error  criterion,  the  simplex 
method  proposed  by  Nelder  and  Mead  [56]  is  used  to  search  the  optimal  parameter  set 
which  minimizes  the  detection  error.  Among  the  parameters  of  our  model,  we  optimize 
only  the  observation  probability  matrix  B.  Experiments  show  the  detection  accuracy 
increases  to  97.3%  on  the  training  set  after  optimization. 

2.5.4  Examples 

HMM  decoding  may  detect  extra  lines  on  the  top  or  bottom  image  borders.  To  reduce  the 
false  alarm  rate,  we  remove  lines  with  less  than  50  black  pixels.  Fig.  2c  shows  the  model- 
based  line  detection  result  for  a  rule-lined  document.  Compared  with  Fig.  2b,  we  can  see 
that,  with  contextual  information,  the  result  is  significantly  improved.  Our  model-based 
method  is  robust  even  when  the  input  images  do  not  follow  the  model  exactly.  Fig.  8a 
shows  an  example:  two  pages  are  overlapped  during  scanning.  Our  algorithm  still  detects 
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Figure  8:  Robust  test  for  rule  line  detection.  The  image  in  (a)  has  enormous  observations 
on  the  projection  profile  (several  lines  missed  because  two  pages  are  overlapped  during 
scanning),  (c)  shows  a  document  with  enormous  line  gaps  (35  image  rows  removed 
manually  in  the  middle),  (b)  and  (d)  show  the  corresponding  line  detection  results. 


all  rule  lines  correctly.  In  Fig.  8c,  we  remove  35  rows  of  the  image  (about  half  of  the 
average  vertical  line  gap  of  this  document).  The  variation  of  the  line  gap  is  out  of  the 
range  allowed  by  the  model.  The  corresponding  detection  result  is  shown  in  Fig.  8d,  with 
only  one  line  missed  due  to  the  anomalous  vertical  line  gap. 

2.6  Experiments 

In  this  section,  we  present  our  evaluation  metrics,  quantitatively  evaluate  the  robustness 
of  our  line  detection  algorithm,  and  compare  it  with  several  non-model-based  algorithms. 

2.6.1  Line  Detection  Evaluation  Protocol 

Line  detection  accuracy  can  be  evaluated  at  the  pixel  and  line  levels  [57].  At  the  pixel 
level  we  compare  the  difference  of  the  pixels  between  ground-truth  and  detected  lines. 
It  is  straightforward  and  objective,  but  ground-truthing  at  the  pixel  level  is  extremely 
expensive  when  lines  are  broken,  distorted,  and  overlapped  with  text.  Therefore,  we 
evaluate  the  algorithm  at  the  line  level.  Our  evaluation  metric  is  based  on  the  Hausdorff 
distance.  The  Hausdorff  distance  between  two  point  sets  is 

H(A,  B )  =  rna x{h(A,  B ),  h(B,  A)}  (28) 

where 

h(A,  B)  =  max  min  |  |a  —  b\ |  (29) 

a^A  b£B 

and  j|.||  is  an  underlying  norm  (e.g.,  the  L2  or  Euclidean  distance).  The  function  h(A,B ) 
is  called  the  directed  Hausdorff  distance  from  A  to  B.  It  identifies  the  point  a  €  A  that  is 
the  farthest  from  any  point  of  B  and  measures  the  distance  from  a  to  its  nearest  neighbor 
in  B  [58] .  The  direct  computation  method  for  the  Hausdorff  distance  is  time  consuming, 
but,  for  polyline  representation,  the  Hausdorff  distance  can  be  easily  calculated.  Sup¬ 
pose  polylines  A  and  B  are  represented  as  a  sequence  of  vertices  (Ai,  A2, . . .  ,Am)  and 


23 


Figure  9:  Hausdorff  distance  between  two  polylines. 


(Bi,  B2, . . . ,  Bn)  respectively,  then  the  Hausdorff  distance  is  simplified  as 

H(A,  B)  =  ma x{H'(A,  B),  e(A,  B)}  (30) 

where 

H'(A,  B)  =  max{DA1,DA2,...,DAm,  D  Bh  ■Db‘1')  •  •  •  5  D Bn  }  (31) 

e(A,  B)  =  maxdl^x  -  Si||,  ||^lm  -  H„||}  (32) 


DAi  is  the  perpendicular  distance  from  At  to  polyline  B ,  and  DBl  is  the  perpendicu¬ 
lar  distance  from  Bi  to  polyline  A,  as  shown  in  Fig.  9.  H'(A,B)  in  Eq.  (31)  is  the 
perpendicular  distance  between  two  polylines  A  and  B ,  which  evaluates  the  accuracy  in 
determining  the  vertical  location  of  a  horizontal  line  and  the  horizontal  location  of  a 
vertical  line.  |  \Aj,  —  Bj\  |  is  the  Euclidean  distance  between  points  Ai  and  By  Suppose  the 
vertices  of  a  polyline  are  sorted  from  left  to  right  for  a  horizontal  line,  and  top  to  bottom 
for  a  vertical  line.  Then  ||Ai  —  Bi\\  and  || Am  —  Bn ||  are  the  end  point  determination 
errors.  Hausdorff  distance  H(A,  B )  in  Eq.  (30)  combines  the  perpendicular  distance  and 
end  point  determination  errors  into  one  metric. 

For  severely  broken  lines,  however,  it  is  hard  to  define  the  end  points  exactly.  There¬ 
fore,  we  prefer  to  use  two  separate  metrics:  the  perpendicular  distance  and  end  point 
determination  error  for  evaluation,  instead  of  a  combined  Hausdorff  distance. 

The  end  point  determination  error  is  an  absolute  value.  As  a  supplemental  metric, 
the  overlap  rate  of  polylines  A  and  B 


°(A,  B) 


min{Am,  Bn}  -  max{Ai,  Hi} 
max{Am,  Bn}  -  min{Ai,  Hi} 


(33) 


is  defined  to  evaluate  the  relative  end  point  determination  error. 

As  suggested  in  [59],  if  a  detected  line  is  within  no  more  than  Eve  pixels  to  a  ground- 
truthed  line  in  the  perpendicular  direction,  it  is  said  to  be  correctly  detected.  If  the 
perpendicular  distance  is  larger  than  five  pixels  and  no  more  than  ten  pixels,  it  is  said 
to  be  partially  correct.  Splitting  and  merging  errors  are  all  assigned  as  partially  correct 
too. 


2.6.2  Quantitative  Evaluation  for  Rule  Line  Detection 

We  obtained  168  Arabic  document  images  with  a  total  of  3,870  ground-truthed  rule 
lines,  most  of  which  are  severely  broken.  We  use  100  images  to  train  the  HMM,  and  the 
remaining  68  images  as  the  test  set.  The  detection  results  on  the  test  set  are  shown  in 
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Table  2:  Comparison  of  our  model-based  method  with  other  methods  on  the  test  set  for 
rule  line  detection  (there  are  a  total  of  1,596  ground-truthed  lines). 


Detected 

Correct 

Partial  Correct 

Missed 

False  Alarm 

Hough  Transform 

1,588 

1,299  (81.4%) 

60  (3.8%) 

237  (14.9%) 

229  (14.4%) 

Projection  Method 

1,577 

1,310  (82.1%) 

112  (7.0%) 

174  (10.9%) 

155  (9.7%) 

DSCC 

2,162 

1,398  (87.6%) 

118  (7.4%) 

80  (5.0%) 

646  (40.5%) 

Our  Model- 
Rased  Method 

1,631 

1,545  (96.8%) 

49  (3.0%) 

2  (0.1%) 

37  (2.3%) 

the  last  row  of  Table  2.  On  the  test  set,  96.8%  of  lines  are  detected  correctly,  with  only 
two  lines  missed.  The  false  alarm  rate  is  2.3%.  Most  of  the  false  alarms  are  caused  by  the 
inconsistency  between  the  detector  and  the  subjective  judgment  of  the  ground-truther 
when  lines  are  severely  broken.  For  correctly  detected  lines,  we  evaluate  the  end  point 
determination  accuracy  using  the  end  point  determination  error  and  overlap  rate  defined 
in  Eq.  (32)  and  (33)  respectively.  The  average  end  point  determination  error  is  six  pixels 
and  the  overlap  rate  is  99.1%. 

We  compared  our  model-based  line  detection  algorithm  with  other  non-model-based 
line  detection  algorithms:  the  Hough  transform  method  [6],  the  projection  method  [7], 
and  the  DSCC  method  [11].  Table  2  shows  the  line  detection  results  on  the  test  set 
with  different  algorithms.  The  results  of  the  Hough  transform  and  projection  methods 
listed  in  the  table  are  tested  on  the  images  after  text  filtering.  The  projections  of  lines 
often  fail  to  form  peaks  on  the  projection  profile,  if  lines  overlap  with  text  or  they  are 
severely  broken.  Text  filtering  helps  lines  to  form  peaks  on  the  projection  profile,  therefore 
increases  the  detection  rate.  On  this  data  set,  under  roughly  the  same  false  alarm  rate, 
the  detection  accuracy  increases  from  73%  on  raw  images  to  82%  on  text-filtered  images. 
For  either  projection  or  Hough  transform  methods,  only  those  peaks  with  values  larger 
than  a  threshold  are  picked  as  line  positions.  With  a  small  threshold,  we  can  detect 
more  lines,  but  the  false  alarm  rate  is  high.  Increasing  the  threshold  will  reduce  the  false 
alarm  rate,  but  increase  the  mis-detection  rate.  We  selected  the  threshold  to  make  the 
false  alarm  rate  roughly  equal  the  mis-detection  rate.  To  reduce  the  false  alarm  rate 
of  the  Hough  transform  method  further,  we  restrict  the  search  range  of  9  to  [—1°,  1°] 
after  skew  correction.  For  the  DSCC  method,  we  restrict  the  merging  direction  to  the 
horizontal  direction.  As  expected,  our  model-based  method  achieved  much  better  results 
in  both  accuracy  and  false  alarm  rate,  due  to  the  use  of  high  level  constraints  between 
neighboring  lines. 

2.7  Summary  and  Future  Work 

We  present  a  novel  approach  to  detect  severely  broken  rule  lines  in  documents.  Our 
method  is  based  on  a  stochastic  model  to  incorporate  high  level  constraints  into  a  gen¬ 
eral  line  detection  algorithm.  Instead  of  detecting  lines  individually,  we  use  the  Viterbi 
algorithm  to  detect  all  parallel  lines  simultaneously.  Our  method  can  detect  96.8%  of 
the  severely  broken  rule  lines  in  the  Arabic  database  we  collected.  Some  challenging 
examples  demonstrated  the  robustness  of  our  approach. 

After  detection,  rule  lines  must  be  removed  before  further  document  processing. 
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Fig.  2d  shows  the  document  image  after  rule  line  removal.  The  result  is  reasonable; 
when  text  strokes  overlap  with  lines,  some  parts  of  text  strokes  may  be  removed  erro¬ 
neously.  A  more  robust  method  should  be  developed  to  improve  the  line  removal  results. 
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3  Known  Form  Processing 
3.1  Introduction 

Millions  of  form  documents,  such  as  tax  return  forms,  health  insurance  forms,  airline 
vouchers,  checks,  and  bank  slips,  are  processed  everyday  [13,14,17,60-62],  Processing 
of  such  documents  can  be  categorized  as  unknown  and  known  form  processing  [14].  Un¬ 
known  form  processing  assumes  no  a  priori  knowledge  from  the  input  forms,  and  extracts 
all  information  based  on  low  level  image  analysis.  Errors  are  expected  and  user  assis¬ 
tance  is  required.  Known  form  processing,  on  the  other  hand,  is  designed  to  process  a 
pre-defined  set  of  forms,  where  a  priori  information  can  be  stored  as  templates  in  the 
database  to  guide  later  processing.  It  is  widely  used  in  banks,  post  offices,  and  tax  offices 
where  the  types  of  forms  are  most  often  pre-defined.  For  an  input  form,  the  system 
first  selects  the  template  that  it  matches  best  ( form  identification ),  then  extracts  an¬ 
chors  (such  as  specific  marks  and  form  frame  lines)  for  registration  to  compensate  for 
variations  produced  by  scanning  (e.g.,  rotation,  translation,  scaling,  and  local  nonlinear 
distortions)  2.  Finally,  the  identified  template  is  used  to  guide  the  system  to  recognize 
fields  of  interest  on  the  form  (different  OCR  engines  may  be  used  for  different  fields), 
and  output  the  recognition  results  to  a  database.  Although  special  anchors  may  be  avail¬ 
able  to  facilitate  form  identification  and  registration  for  specially  designed  forms,  more 
general  approaches  use  features  related  explicitly  or  implicitly  to  frame  lines,  such  as  the 
frame  lines  themselves  [13,14,60,61],  form  cells  [62],  and  intersections  of  frame  lines  [17]. 
Robust  detection  of  frame  lines  is  crucial  to  these  approaches.  In  the  previous  chapter, 
we  proposed  a  model-based  parallel  line  detection  algorithm  using  hidden  Markov  models 
(HMM).  In  this  chapter,  we  apply  it  for  known  form  processing.  As  shown  in  Fig.  10a, 
generally  there  are  two  groups  of  parallel  lines  (one  horizontal  and  the  other  vertical) 
on  a  form,  so  we  use  two  HMMs  to  detect  the  horizontal  and  vertical  lines  separately. 
The  detected  lines  can  be  used  for  registration.  Our  algorithm  can  be  extended  for  form 
identification  too,  so  in  our  approach,  a  unified  framework  solves  both  important  tasks 
in  known  form  processing. 

For  known  form  processing,  we  need  to  not  only  detect  lines  reliably,  but  also  find  the 
correspondence  between  the  detected  lines  and  those  stored  in  the  form  template  [13,14], 
The  method  proposed  by  Tang  et  al.  [13]  assumes  there  is  only  one  anchor  line  in  a 
pre-defined  region,  which  can  be  distinguished  easily  from  other  lines.  The  application 
of  this  method  is  restricted.  Considering  false  alarms  and  mis-detections,  the  correspon¬ 
dence  problem  is  not  trivial.  Cesarini  et  al.  [14]  proposed  a  hypothesis  and  verification 
paradigm  as  a  solution.  For  a  detected  line  in  a  pre-define  region,  several  hypotheses 
are  generated  about  correspondence  between  the  line  and  those  in  the  template.  Under 
each  hypothesis,  the  rough  positions  of  other  lines  can  be  determined,  then  the  system 
searches  the  expected  lines  to  verify  the  hypothesis.  The  output  of  the  verification  mod¬ 
ule  is  binary:  success  or  failure.  All  lines  used  for  registration  should  be  detected  to 
achieve  a  consistent  solution,  so  it  is  not  robust  to  line  degradations.  Both  methods  need 

2If  preprinted  content  (fixed  part)  dominates  user- filled  information  (variant  part),  general  image  reg¬ 
istration  methods  (e.g.,  correlation-based  methods)  can  be  used  for  form  registration  without  detecting 
any  lines  or  landmarks. 
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(a)  (b) 

Figure  10:  (a)  An  example  deposit  form  of  the  Industrial  and  Commercial  Bank  of  China. 
There  are  two  groups  for  parallel  lines  on  this  form,  one  horizontal  and  the  other  vertical, 
(b)  Line  detection  result  using  our  model-based  approach. 


an  initial  region  to  detect  the  first  anchor  line,  and  only  a  subset  of  lines  are  used  for 
registration.  In  our  approach,  we  use  all  lines  for  registration,  but  we  do  not  perform 
binary  assertion  during  HMM  decoding.  Instead,  we  measure  the  probability  of  a  projec¬ 
tion  to  be  generated  by  a  line.  The  optimal  detection  results  are  achieved  by  the  Viterbi 
algorithm.  The  degradation  of  a  few  lines  may  not  deteriorate  the  performance.  Another 
advantage  of  our  approach  is  that  the  detection  and  correspondence  problems  are  solved 
simultaneously.  After  HMM  decoding,  the  correspondence  between  the  detected  lines 
and  those  in  the  form  template  (or  the  model)  is  achieved  automatically. 

The  remainder  of  this  chapter  is  organized  as  following.  In  Section  3.2,  we  apply 
our  general  model-based  line  detection  algorithm  for  form  frame  line  detection.  The 
quantitative  evaluation  of  the  robustness  of  our  approach  is  presented  in  Section  3.3. 
Our  approach  can  be  extended  for  form  identification,  which  is  described  in  Section  3.4. 
This  chapter  ends  with  a  brief  summary  in  Section  3.5. 

3.2  Form  Frame  Line  Detection 

The  application  of  the  algorithm  to  known  form  processing  is  straightforward.  Generally, 
a  collection  of  horizontal  and  vertical  parallel  lines  exists  on  a  form,  so  we  use  two  HMMs 
to  detect  the  horizontal  and  vertical  lines  separately.  To  apply  the  algorithm,  we  need 
to  estimate  two  sets  of  parameters:  (1)  The  distribution  of  observation  symbols  of  each 
state;  and  (2)  The  state  duration  probabilities  of  each  gap  state. 

3.2.1  Estimation  of  the  Distributions  of  Observation  Symbols 

In  our  case,  the  observation  symbols  are  the  projection  profile,  which  has  the  range  of 
[0,  w]  for  a  horizontal  projection  (where  w  is  the  width  of  the  image).  As  we  stated 
previously,  a  large  number  of  observation  symbols  would  cause  difficulties  in  reliably 
estimating  the  distributions  with  limited  training  samples.  With  some  assumptions,  we 
can  show  that  using  a  Gaussian  distribution  to  model  projections  of  a  line  over  multiple 
form  instances  is  appropriate.  In  a  widely  used  stochastic  document  image  degradation 
model  [63],  a  white  (black)  pixel  is  randomly  selected  and  flipped  to  black  (white).  The 


Figure  11:  The  distributions  of  the  observation  symbols  (horizontal  projections)  for  100 
scanned  instances  of  a  bank  deposit  form.  One  example  of  the  form  is  shown  in  Fig.  10a. 
There  are  six  horizontal  lines  on  the  form,  (a)  to  (f)  The  histograms  of  the  projections 
of  six  horizontal  lines  respectively,  (g)  The  histogram  of  the  projections  of  the  non-line 
states. 


projection  is  the  summation  of  all  black  pixels  on  the  line 


h  =  a.j. 


where 

_  J  1  if  black  pixel  i  is  preserved  ,  , 

1  0  if  black  pixel  i  is  flipped  to  white  during  degradation 

U rider  white  Gaussian  noise  (a  widely  used  model  for  degradation),  a,-  follows  a  Bernoulli 
distribution:  a,  ~  Bernoulli(p),  where  p  is  the  probability  for  a  black  pixel  to  be  lost. 
Consequently,  h  follows  a  binomial  distribution  Bin(p ,  M) 

p(h)=  (((V'Al  -  p)".  (36) 

According  to  the  central  limit  law,  if  M  is  large  enough  (or  if  the  line  is  long  enough), 
then  the  distribution  of  random  variable  h  converges  to  a  Gaussian  distribution  [51] 


,  h  -  E[h] 
lim  —  = 

M^°°  \/Mp(  1  -  p) 


J\f( 0, 1)  in  distribution. 


In  known  form  processing,  a  set  of  forms  are  captured  with  similar  imaging  conditions. 
Therefore,  p  is  roughly  constant  for  each  form  in  the  set.  A  Gaussian  distribution  is  a  good 
approximation  for  the  projections  of  a  line  over  multiple  form  instances.  The  mean  and 
variance  of  the  Gaussian  distribution  can  be  estimated  from  the  ground-truth.  Figs.  11a 
to  Ilf  show  the  distributions  of  the  projections  of  all  six  horizontal  lines  on  a  set  of  bank 
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Table  3:  The  distribution  of  the  line  gap  between  the  first  and  second  horizontal  lines  on 
a  bank  deposit  form.  The  average  is  94  pixels.  The  row  of  distance  lists  the  difference 
to  the  average  value. 


Distance 

-9 

-8 

-7 

-6 

-5 

-4 

-3 

-2 

-1 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

Raw  Occurrence 

1 

0 

2 

0 

0 

3 

5 

12 

18 

24 

16 

6 

8 

4 

0 

1 

0 

0 

0 

Symmetric 

Regularization 

.5 

0 

1 

.5 

0 

3.5 

6.5 

9 

17 

24 

17 

9 

6.5 

3.5 

0 

.5 

1 

0 

.5 

Zero-Occurrence 

Smoothing 

.5 

.5 

1 

.5 

.5 

3.5 

6.5 

9 

17 

24 

17 

9 

6.5 

3.5 

.5 

.5 

1 

.5 

.5 

deposit  forms  with  one  instance  shown  in  Fig.  10a.  The  histogram  is  generated  over  100 
form  samples.  We  can  see  that  the  Gaussian  distribution  is  a  good  approximation.  For 
non-line  states,  the  approximation  is  not  good  enough,  since  a  projection  is  always  no 
less  than  zero  (an  exponential  distribution  may  be  more  suitable),  as  shown  in  Fig.  llg. 
We  found  in  the  experiments  that  the  effect  of  this  approximation  error  is  negligible  for 
the  final  line  detection  result. 

3.2.2  Estimation  of  the  State  Durations 

The  state  duration  of  Sa,i,  i  =  1,  2, . . . ,  N  —  1,  represents  the  line  gap  between  lines  i  and 
%  +  1,  which  can  be  estimated  from  the  ground-truth.  Table  3  shows  the  distribution  of 
the  gap  between  the  first  and  second  horizontal  lines  on  the  bank  deposit  form  in  our 
database  of  100  samples.  The  average  value  of  the  gap  is  94  pixels.  The  row  of  distance 
lists  the  difference  to  the  average  value.  The  row  of  raw  occurrence  shows  the  number  of 
occurrences  in  which  the  gap  takes  a  specific  value.  We  can  see  that  the  variation  is  from 
-9  pixels  to  6  pixels,  and  the  distribution  is  roughly  symmetric  around  the  average  value. 
Due  to  the  sparse-data  problem,  some  entries  within  the  range  of  [-9,  6]  are  not  observed 
in  the  training  set,  which  will  deteriorate  the  performance.  Therefore,  data  smoothing  is 
used.  The  row  of  symmetric  regularization  is  the  result  after  we  impose  the  symmetry. 
Lastly,  we  set  the  zero  entries  to  the  minimal  value  of  all  non-zero  entries,  as  shown  in 
the  row  of  zero- occurrence  smoothing.  After  data  smoothing,  the  distribution  P\  (d)  can 
be  estimated  by  normalization.  Similarly,  we  can  get  the  distributions  of  other  line  gaps. 

3.2.3  Decoding 

After  estimating  parameters,  we  use  the  Viterbi  algorithm  to  decode  the  observation. 
Fig.  12  shows  the  decoding  results  of  the  Viterbi  algorithm  on  the  horizontal  and  vertical 
projection  profiles  of  the  bank  deposit  form  (Fig.  10a).  The  locations  found  by  the  Viterbi 
algorithm  are  labeled  with  squares.  We  can  see  that  instead  of  picking  the  highest  peaks 
as  detected  lines  in  the  projection  methods  [7,10],  our  approach  outputs  the  line  positions 
most  compatible  with  the  model. 

After  detecting  the  horizontal  and  vertical  lines,  the  method  described  in  the  previous 
chapter  can  be  used  to  determine  the  end  points  of  the  lines.  However,  if  a  line  is 
severely  degraded,  the  end  points  cannot  be  determined  accurately.  For  many  forms,  the 
intersections  of  horizontal  and  vertical  lines  can  be  used  to  determine  the  end  points. 
Sometimes,  several  lines  may  lie  on  the  same  line,  for  example,  three  dashed  lines  in  the 
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Figure  12:  The  lines  detected  after  decoding  the  HMMs  using  the  Viterbi  algorithm  on 
the  horizontal  (a)  and  vertical  (b)  projection  profiles  of  the  bank  deposit  form.  The 
original  form  is  shown  in  Fig.  10a.  The  locations  picked  np  by  the  Viterbi  algorithm  are 
labeled  with  squares. 


middle  of  the  form  as  shown  in  Fig.  10a.  Our  HMM-based  method  can  handle  this  special 
case  without  difficulty.  In  this  example,  the  vertical  line  gaps  between  dashed  lines  are 
zero.  They  share  the  same  horizontal  projection.  The  Viterbi  algorithm  gives  the  vertical 
position  of  these  lines.  The  left  and  right  end  points  are  determined  using  their  position 
relative  to  the  intersections  of  horizontal  and  vertical  lines.  In  this  case,  horizontal  and 
vertical  lines  should  be  extended  to  get  the  intersection  points.  Fig.  10b  shows  the  model- 
based  line  detection  result.  We  can  see  our  method  can  detect  the  short  lines  which  may 
not  form  peaks  on  the  projection  profile  (especially  for  the  two  shortest  vertical  lines), 
which  are  most  likely  missed  by  other  methods  such  as  the  Hough  transform  or  projection 
methods.  Our  method  outputs  the  exact  number  of  lines  indicated  by  the  model  without 
false  alarms.  Fig.  13  shows  two  more  examples  of  an  export  registration  form  used  by 
the  Customs  Bureau  of  China  and  a  portion  of  a  US  income  tax  form. 

3.3  Experiments  for  Form  Frame  Line  Detection 

To  evaluate  the  algorithm  for  known  form  processing,  we  collected  100  bank  deposit 
forms.  In  this  experiment,  we  did  not  evaluate  the  accuracy  of  form  registration  directly. 
The  accuracy  of  form  registration  depends  on  which  deformation  model  (global  affine 
transformation  or  more  flexible  local  deformation)  is  used  to  transform  the  input  form 
to  the  prototype  form.  Since  the  detected  lines  are  used  for  both  form  identification 
(discussed  in  the  next  section)  and  registration,  we  evaluate  the  line  detection  accuracy. 

The  experiment  demonstrated  that  one  training  sample  can  achieve  reasonable  results 
if  image  quality  is  good.  We  selected  the  first  image  for  training.  The  real  value  of  the 
projection  of  a  line  in  this  training  sample  is  taken  as  the  mean  of  the  observation  random 
variable  of  the  corresponding  line  state.  The  variance  of  the  observation  random  variable 
of  a  line  state  is  set  as  20%  of  its  mean.  The  distribution  of  line  gaps  is  set  within 
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Figure  13:  Some  examples  for  model-based  form  frame  line  detection,  (a)  and  (b)  An 
export  registration  form  used  by  the  Customs  Bureau  of  China  and  the  corresponding  line 
detection  result,  (c)  and  (d)  A  portion  of  a  US  income  tax  form  and  the  corresponding 
line  detection  result.  The  detected  lines  are  shown  in  black  and  overlay  with  the  original 
documents. 


Table  4:  Comparison  of  our  model-based  method  with  other  methods  for  known  form 
processing  (there  are  a  total  of  1,980  ground-truthed  lines). 


Detected 

Correct 

Partial  Correct 

Missed 

False  Alarm 

Hough  Transform 

1,980 

1,675  (84.6%) 

8  (0.4%) 

297  (15.0%) 

297  (15.0%) 

Projection  Method 

1,980 

1,745  (88.1%) 

15  (0.8%) 

223  (11.3%) 

220  (11.1%) 

DSCC 

2,032 

1,803  (91.1%) 

125  (6.3%) 

175  (8.8%) 

104  (5.3%) 

(Jur  Model- 
_ Rased  Method _ 

1,980 

1,976  (99.8%) 

4  (0.2%) 

0  (0.0%) 

0  (0.0%) 

the  range  of  [-10,  10]  pixels  around  its  real  value  in  this  sample.  We  tested  it  on  the 
remaining  99  form  images.  The  last  row  in  Table  4  shows  the  result,  using  the  evaluation 
metrics  defined  in  the  previous  chapter.  All  lines  are  detected  without  any  false  alarms. 
Only  four  lines  are  detected  with  large  location  errors.  For  comparison,  Table  4  shows 
the  detection  results  of  other  algorithms.  Both  Hough  transform  and  projection  methods 
need  a  threshold,  the  minimum  pixels  on  a  line,  to  reduce  the  false  alarm  rate.  To 
avoid  using  an  arbitrarily  threshold,  we  selected  the  first  six  longest  horizontal  lines 
and  14  longest  vertical  lines  as  the  detection  results  for  both  the  Hough  transform  and 
projection  methods.  Our  algorithm  clearly  outperforms  all  three  general  line  detection 
methods  in  both  mis-detection  and  false  alarm  rates. 

In  the  following  experiments,  we  tested  the  robustness  of  our  method  under  different 
scanning  resolutions,  scanning  binarization  thresholds,  and  synthesized  image  degrada¬ 
tions.  Generally,  the  more  severe  the  degradation,  the  more  accurate  the  model  should 
be  in  order  to  detect  lines  correctly.  Therefore,  in  the  following  experiments,  we  increased 
the  number  of  training  samples.  We  randomly  selected  50  forms  for  training,  and  used 
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Resolution  (dpi)  Binarization  threshold  Ratio  of  black  pixels  flipped 

(a)  (b)  (c) 

Figure  14:  Robustness  testing.  The  curves  labeled  with  o  shows  the  detection  accuracy 
under  the  condition  where  the  test  set  has  the  same  degradation  level  with  the  training 
set.  Curves  labeled  with  +  are  the  results  when  the  test  set  and  the  training  set  have 
different  degradation  levels,  (a)  Scanning  resolution,  (b)  Binarization  threshold,  (c) 
Synthesized  degradation. 


the  remaining  50  forms  for  testing.  Fig.  14a  shows  the  line  detection  accuracy  under 
different  scanning  resolutions.  As  we  can  see,  the  performance  of  the  algorithm  stays 
consistently  high  under  a  wide  range  of  scanning  resolutions  from  75  dpi  to  600  dpi.  The 
line  width  varies  from  about  one  pixel  under  75  dpi  resolution  to  10  pixels  under  600 
dpi.  Though  onr  model  does  not  include  the  duration  of  line  states,  this  inaccuracy  in 
modeling  has  a  negligible  effect  on  its  performance. 

In  the  next  experiment,  we  fixed  the  scanning  resolution  to  300  dpi  and  used  different 
binarization  thresholds  during  scanning.  If  the  threshold  is  too  small,  the  lines  are 
severely  broken  as  shown  in  Fig.  15a  (with  the  threshold  of  40).  If  the  threshold  is  too 
large,  text  and  lines  are  smeared  together,  as  shown  in  Fig.  15c  (with  the  threshold  of 
240).  As  shown  in  Fig.  15b  and  d,  our  algorithm  can  still  detect  lines  correctly  under 
such  extreme  conditions.  The  quantitative  evaluation  result  is  shown  in  Fig.  14b.  The 
curve  labeled  with  o  in  the  figure  shows  the  detection  accuracy  when  the  training  set 
and  test  set  are  scanned  with  the  same  binarization  threshold.  In  most  applications,  the 
test  set  may  have  different  characteristics  with  the  training  set.  The  curve  labeled  with 
+  in  Fig.  14b  shows  the  detection  accuracy  using  the  HMM  trained  on  the  training  set 
scanned  with  a  binarization  threshold  of  128.  As  we  can  see,  good  results  are  achieved  in 
a  wide  range  even  though  the  test  set  has  different  characteristics  with  the  training  set. 

Synthesized  data  are  often  used  to  test  an  algorithm  because  it  can  directly  control 
the  image  quality  of  the  test  samples.  In  the  following  experiments,  we  selected  the  data 
set  with  good  image  quality  (scanned  with  300  dpi  and  the  binarization  threshold  of 
128),  and  randomly  flipped  a  certain  ratio  of  black  pixels  on  lines  to  white,  keeping  all 
pixels  on  text  unchanged.  Figs.  16a  and  16c  show  the  degraded  images  with  50%  and 
90%  black  pixels  on  lines  flipped  to  white,  respectively.  As  shown  in  Fig.  16b,  the  line 
detection  result  is  perfect  even  if  half  the  black  pixels  are  flipped.  Fig.  16d  shows  that 
the  horizontal  lines  are  still  detected  correctly  even  when  90%  black  pixels  are  flipped, 
but  the  vertical  lines  are  misdetected.  Fig.  14c  shows  the  detection  accuracy  versus 
degradation  level  on  the  test  set.  The  curve  labeled  with  o  shows  the  results  when  the 
test  sets  have  the  same  degradation  level  with  the  training  sets.  We  can  see  that  our 


33 


Figure  15:  Scanned  form  documents  with  different  binarization  thresholds  and  the  cor¬ 
responding  line  detection  results,  (a)  and  (c)  Scanned  images  under  thresholds  of  40  and 
240  respectively,  (b)  and  (d)  are  corresponding  line  detection  results  of  (a)  and  (c).  The 
detected  lines  are  shown  in  black  and  overlay  with  the  original  documents. 

method  is  very  robust.  It  maintains  good  results  with  accuracy  of  96.2%  even  when  80% 
black  pixels  of  lines  are  flipped.  The  curve  labeled  with  +  shows  the  accuracy  on  the 
test  sets  using  the  HMM  model  trained  on  samples  with  the  degradation  level  of  50%. 
Almost  the  same  accuracy  is  achieved  until  70%  black  pixels  on  lines  are  flipped.  After 
that,  it  breaks  down  faster. 

3.4  Form  Identification 

Our  line  detection  algorithm  can  be  extended  to  form  identification.  Suppose  there  are 
n  form  templates  Ai,  A2, . . . ,  An.  According  to  the  Bayesian  rule,  A  that  maximizes  the 
posteriori  probability  is  selected  as  the  template  for  the  input  form 

A  =  arg  max  P(Aj|0)  =  arg  max P(0|A*)P(Aj).  (38) 

A  i  A  i 

Here,  O  is  the  observation  (the  projection  profile  in  our  method).  P( A*)  is  the  priori 
probability  of  form  template  A*.  P(0|A*)  is  the  probability  of  observing  the  sequence 
of  observations  given  the  model  A *,  which  can  be  calculated  efficiently  with  the  forward 
algorithm  [50]. 

Although  a  Gaussian  distribution  is  a  good  approximation  for  observations  of  a  line 
state,  it  is  not  good  enough  to  approximate  the  observations  of  a  non-line  state,  as  shown 
in  Fig.  llg.  As  demonstrated  experimentally,  such  approximation  error  does  not  affect 
the  line  detection  results  noticeably.  However,  it  will  make  calculating  the  probability 
P(0|A)  in  Eq.  (38)  un-reliable,  since  the  observations  are  dominated  by  non-line  states. 
Alternatively,  given  a  form  model  A,  we  first  detect  lines  under  the  model.  Suppose, 
hL  1,  hc,2,  •  •  • ,  hLN  are  decoded  as  observations  of  line  states,  and  gi,  g2,  ■  ■  ■ ,  are  line 


34 


if NC  P1702H1 

ICBC  BJ  BR  .FOREIGN  CURRENCY  TIMEDEPOSIT  SAVlNGjjjRECEIPT 

A  A  a  •  *  -ft  3  •'!  *  I 

000108  50000108  »  2.0  0  ;*  S.  1843  2002/01/08  '  '16  i 


5!  1 00-00020599-7  /» 

S  {ACCOUNT  NO.  . DEPOSITOR 


US^3,  109.  90 

LAMOUNT,^— 

25.75  3.109.90  100000705997 

5.15  SI**!®:  3,130.50 


i  «.  •-?■ 

_ ORIGINAL 

ACCOUNT  NO  V 


A« 


4ite  ■- 


ICBC  BJ  BR  .FOREIGN  CURP 


N?  H17I 


US$3,  109.  90 

XT,  ^-FFSE-fiT 
jB!BB6^lSSffi30  25.75 


T  NO  K‘I 


W  7L  m.  m  45- 
3,109.90  100000705997 
Ii4»:  3,130.50 


^rb.tf%3isr/ , 


(b) 


M.  *****tf*#**Nf****W#+lN  P1702G1 

'.  *t»:  (■»)  ICBC  BJ  BR  .FOREIGN  CURRENCY  TIME  DEPOSIT  SAVINGJRECE1PT 

1  *ai  ;  *«8  ;|  ^  ■**  *  *  *  **#*  ;•«#; 

,20000108  20000108  -j»  IS  .2.0  0  j*  5.1843  K  2002/01/08  116 

|.  *  100-00070599-7  J»  *  *  *•  '» 

*  ACCOUNT  NO.  DEPOSITOR  THE  ORIGINAL 

g:  ACCOUNT  NO  »T  ■$£  .  : 

US$3.  109.90  sr  a; 

?™i£&2Ki*5E  ^•faF*-fe^3^.5E35R.*fese5^  *i_*3i*^  K 

25.75  3,109.90  100000705997  iTte.Kt  TDOlfc. J  ,  . 

5.15  3,130.50  ■ 


**** 


iit 


«*  *#*$.. . 


+  *XlW****2MWh*****.&*  N?  P1702G1 

ICBC  BJ  BR  .FOREIGN  CURRENCY  TIME  DEPOSIT  SAVIN GjjRECEIPT 


(d) 


Figure  16:  Degraded  form  documents  and  the  corresponding  line  detection  results.  Red 
lines  drawn  on  original  images  indicate  the  detected  lines,  (a)  About  50%  of  the  pixels 
of  the  lines  are  flipped,  (c)  about  90%  of  pixels  of  the  lines  are  flipped,  (b)  and  (d)  are 
corresponding  line  detection  results  to  (a)  and  (c). 


gaps,  the  probability  of  the  input  form  sample  belonging  to  the  model  A  is  approximated 
as 

clow  =  (39) 

In  the  above  equation,  we  omit  the  observations  of  non-line  states.  The  model  with  the 
highest  probability  is  selected  as  the  final  form  identification  result. 

We  test  the  proposed  method  on  the  NIST  Structured  Forms  Reference  Set,  NIST 
Special  Database  2  [64],  The  data  set  consists  of  5,590  pages  of  binary,  black-and-white 
images  of  synthesized  documents.  The  documents  in  this  database  are  12  different  tax 
forms  from  the  IRS  1040  Package  X  for  the  year  1988.  These  include  Forms  1040,  2106, 
2441,  4562,  and  6251  together  with  Schedules  A,  B,  C,  D,  E,  F,  and  SE.  Eight  of  these 
forms  contain  two  pages  or  form  faces  for  a  total  of  20  different  form  faces  represented 
in  the  database.  The  number  of  samples  of  each  form  face  varies  from  59  to  900.  The 
first  20  samples  of  each  form  face  are  used  for  training  and  the  rest  for  testing.  The  form 
identification  results  are  perfect  with  an  accuracy  of  100%. 

3.5  Summary  and  Future  Work 

In  this  chapter,  we  applied  our  general  model-based  line  detection  algorithm  to  known 
form  processing.  There  are  two  tasks  in  known  form  processing:  form  identification  and 
form  registration.  These  two  tasks  can  be  solved  in  one  unified  framework  by  extending 
our  model-based  line  detection  algorithm.  Our  approach  is  robust  under  a  wide  range 
of  scanning  resolutions,  binarization  thresholds,  and  synthesized  degradation  levels,  as 
demonstrated  experimentally.  A  further  improvement  of  the  proposed  work  may  use 
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the  Gaussian  mixture  distributions  or  the  exponential  distribution  to  replace  the  simple 
Gaussian  distribution  to  model  the  observations  of  non-line  states. 
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4  Handwriting  Identification  in  Noisy  Document  Images 
4.1  Introduction 

Handwriting  often  combines  with  machine  printed  text.  Handwriting  in  a  machine 
printed  document  often  indicates  corrections,  additions,  or  other  supplemental  infor¬ 
mation  that  should  be  treated  differently  from  the  main  content.  The  segmentation  and 
recognition  techniques  requested  for  machine  printed  and  handwritten  text  differ  signifi¬ 
cantly.  Therefore,  identification  of  handwriting  from  machine  printed  text  is  important 
for  the  following  document  image  analysis. 

Handwriting/ machine  printed  text  discrimination  can  be  performed  at  different  levels, 
such  as  the  text  line  [17-20],  word  [21],  or  character  level  [22,23].  Special  consideration 
must  be  given  to  the  size  of  the  region  being  segmented  before  performing  any  classi¬ 
fication.  We  call  the  smallest  unit  for  classification  a  pattern  unit.  If  the  unit  is  too 
small,  the  information  contained  in  it  may  not  be  sufficient  for  classification.  If  it  is 
too  large,  however,  different  types  of  components  may  be  mixed  in  the  same  region.  In 
previous  work  [23]  we  conducted  a  performance  evaluation  for  the  classification  accuracy 
of  machine  printed  text  and  handwriting  at  the  character,  word,  and  zone  levels.  Exper¬ 
iments  show  that  a  reliable  classification  can  be  achieved  at  the  word  level.  We  therefore 
segment  images  at  the  the  word  level,  then  perform  classification. 

The  data  set  we  are  processing  is  noisy,  which  makes  for  a  more  challenging  prob¬ 
lem.  Most  document  enhancement  algorithms  can  remove  large  noise  components  (e.g., 
marginal  black  strips)  with  some  simple  rules  [15,16],  and  small  noise  components  (e.g., 
pepper-and-salt  noise)  with  morphological  operations.  However,  noise  components  with 
a  compatible  size  to  printed  words  cannot  be  easily  removed.  In  our  approach,  we  treat 
noise  as  a  distinguished  class  and  model  it  based  on  selected  features.  We  treat  the 
problem  as  a  three-class  (machine  printed  text,  handwriting,  and  noise)  identification 
problem. 

In  practice  mis-classification  happens  in  an  overlapping  feature  space.  This  holds  es¬ 
pecially  true  for  handwriting  and  noise.  To  deal  with  this  problem,  we  exploit  contextual 
information  in  post-processing  and  refine  the  classification.  Contextual  information  helps 
improve  classification  accuracy.  Many  OCR  systems  use  it,  and  its  effectiveness  has  been 
demonstrated  in  previous  work  [65,66].  The  key  is  to  model  the  statistical  dependency 
among  neighboring  components.  An  OCR  system  outputs  a  text  stream  that  is  one¬ 
dimensional.  Therefore,  an  N-gram  language  model,  based  on  an  Nth  order  1-D  Markov 
chain,  effectively  models  the  context.  With  assistance  from  a  dictionary,  the  N-gram 
approach  can  correct  most  recognition  errors.  Images,  however,  are  two-dimensional. 
Generally,  2-D  signals  are  not  causal,  and  it  is  much  harder  to  model  the  dependency 
among  neighboring  components  in  an  image.  Among  the  image  models  studied  so  far, 
Markov  random  fields  (MRF)  have  been  widely  studied  and  successfully  used  in  many 
applications  [67].  MRFs  are  suitable  for  image  analysis  because  the  local  statistical  de¬ 
pendency  of  an  image  can  be  well  modeled  by  Markov  properties.  MRFs  can  incorporate 
a  priori  contextual  information  or  constraints  in  a  quantitative  way.  The  MRF  model  has 
been  extensively  used  in  various  image  analysis  applications,  such  as  texture  synthesis 
and  segmentation,  edge  detection,  and  image  restoration  [68,69].  We  use  MRFs  to  model 
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the  dependency  of  segmented  neighboring  blocks.  As  post-processing,  MRFs  can  further 
improve  classification  accuracy. 

The  proposed  method  is  not  limited  to  extracting  handwriting  from  a  heterogeneous 
document.  After  classification,  we  can  output  different  contents  into  different  layers. 
By  separating  noise,  the  layer  of  machine  printed  text  is  much  cleaner  than  the  original 
noisy  document.  Our  approach  can  be  used  as  a  document  enhancement  procedure, 
which  facilitates  the  further  document  image  analysis  tasks,  such  as  zone  segmentation 
and  OCR.  In  this  chapter,  we  demonstrate  the  effectiveness  of  our  approach  on  zone 
segmentation. 

The  remainder  of  this  chapter  is  organized  as  follows.  In  Section  4.2,  we  briefly  review 
the  previous  work  on  hand  writ  ing/machine  printed  text  identification.  Noise  identifica¬ 
tion  and  removal  also  relates  to  our  work  and  is  reviewed.  We  present  the  detailed 
description  of  our  classification  method  in  Section  4.3.  MRF  based  post-processing  is 
discussed  in  Section  4.4,  and  the  experimental  results  are  presented  in  Section  4.5.  The 
effectiveness  of  our  approach  for  document  enhancement  is  demonstrated  in  Section  4.6 
with  the  application  of  zone  segmentation.  The  chapter  concludes  with  a  brief  summary 
and  a  discussion  of  future  work  in  Section  4.7. 

4.2  Related  Work 

Some  work  has  been  accomplished  on  handwriting/machine  printed  text  identification. 
The  classification  is  typically  performed  at  the  text  line  [17-20],  word  [21],  or  character 
level  [22,70].  At  the  line  level,  machine  printed  text  lines  are  typically  arranged  regularly 
with  a  straight  baseline,  while  handwritten  text  lines  are  irregular  with  a  varying  base¬ 
line.  Srihari  et  al.  [20]  implemented  a  text  line  based  approach  using  this  characteristic 
and  achieved  a  classification  accuracy  of  95%.  One  advantage  of  this  approach  is  that 
it  can  be  used  in  different  scripts  (Chinese,  English,  etc.)  with  little  or  no  modifica¬ 
tion.  Guo  et  al.  [21]  proposed  an  approach  based  on  the  vertical  projection  profile  of 
the  segmented  words.  They  used  a  hidden  Markov  model  (HMM)  as  the  classifier  and 
achieved  a  classification  accuracy  of  97.2%.  Although  less  information  is  available  at  the 
character  level,  humans  can  still  identify  the  handwritten  and  machine  printed  characters 
easily,  inspiring  researchers  to  pursue  classification  at  the  character  level.  Kuhnke  [22] 
proposed  a  neural  network-based  approach  with  straightness  and  symmetry  as  features. 
Zheng  et  al.  [70]  used  run-length  histogram  features  to  identify  handwritten  and  printed 
Chinese  characters  and  achieved  promising  results.  In  previous  work,  we  implemented  a 
handwriting  identification  method  based  on  several  categories  of  features  and  a  trained 
Fisher  linear  discriminant  classifier  [23].  However,  the  problems  introduced  by  noise  are 
not  addressed. 

Since  our  approach  can  be  seen  as  a  document  enhancement  technique,  the  work  on 
noise  removal  also  relates  to  our  work.  Noise  may  be  introduced  in  document  images 
through  (1)  physical  degradation  of  the  hard-copy  documents  during  creation,  and/or 
storage,  and  (2)  the  digitization  procedure,  such  as  scanning.  If  severe  enough,  ei¬ 
ther  of  them  can  reduce  the  performance  of  a  document  analysis  system  significantly. 
Several  document  degradation  models  [63,  71,  72],  methods  for  document  quality  as¬ 
sessment  [73,74],  and  document  enhancement  algorithms  [75-77]  have  been  presented 
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in  previous  work.  One  common  enhancement  approach  is  window-based  morphologi¬ 
cal  filtering  [75-77].  Morphological  filtering  performs  a  table  looking-up  procedure  to 
determine  an  output  of  ON  (black  pixel)  or  OFF  (white  pixel)  for  each  entry  of  the 
table,  based  on  a  windowed  observation  of  its  neighbors.  These  algorithms  can  be  fur¬ 
ther  categorized  as  manually  designed,  semi-manually  designed,  or  automatically  trained 
approaches.  The  kFill  algorithm,  proposed  by  O’Gorman  [77],  is  a  manually  designed 
approach  and  has  been  used  by  several  other  researchers  [73,78].  Experiments  show  it  is 
effective  for  removing  salt-and-pepper  noise.  Liang  et  al.  [79]  proposed  a  semi-manually 
designed  approach  with  a  3  x  3  window  size.  They  manually  determine  some  entries  to 
output  ON  or  OFF  based  on  a  priori  observations.  The  remaining  entries  are  trained  to 
select  the  optimal  output. 

ft  is  difficult  to  manually  design  a  filter  with  a  large  window  size,  and  success  depends 
on  experience.  If  both  ideal  and  degraded  images  are  available,  optimal  filters  can  be  de¬ 
signed  by  training  [76].  After  registering  the  ideal  and  degraded  images  at  the  pixel  level, 
an  optimal  look-up  table  can  be  designed,  based  on  observation  of  the  outputs  of  each 
specific  windowed  context.  However,  it  is  difficult  to  train,  store,  and  retrieve  the  look-up 
table  when  the  window  size  is  large.  This  approach  requires  both  the  original  and  the 
corresponding  degraded  images  for  training.  Loce  [76]  used  artificially  degraded  images 
generated  by  models  for  training,  while  Kanungo  et  al.  [80-82]  proposed  methods  for 
validation  and  parameter  estimation  of  degradation  models.  Though  the  uniformity  and 
sensitivity  of  their  approach  has  been  tested  by  other  researchers  [72,83],  no  degradation 
model  has  been  declared  to  pass  the  validation.  Another  problem  with  morphological 
approaches  is  the  small  window  sizes.  The  most  commonly  used  window  size  is  no  larger 
than  5x5,  which  is  too  small  to  contain  enough  information  for  enhancement. 

The  above  approaches  only  identify  and  remove  small-sized  noise  components.  The  re¬ 
moval  of  large-sized  noise  components  is  also  addressed  in  the  literature,  such  as  marginal 
noise  removal  [84]  and  show-through  removal  [85,86].  It  is  hard  to  discriminate  noise  from 
compatible  sized  text.  In  this  dissertation,  we  treat  noise  as  a  distinguished  class  and 
use  a  classification  based  approach. 

4.3  Text  Identification 

In  this  section  we  present  our  text  (machine  printed  or  handwritten)  extraction  and 
classification  method. 

4.3.1  Feature  Extraction 

Several  sets  of  features  are  extracted  for  classification.  Table  5  lists  the  descriptions  and 
sizes  of  the  feature  sets.  Machine  printed  text,  handwriting,  and  noise  have  different 
visual  appearances  and  physical  structures.  Structural  features  are  extracted  to  reflect 
these  differences.  Gabor  filter  features  and  run-length  histogram  features  can  capture 
the  difference  in  stroke  orientation  and  stroke  length  between  handwriting  and  printed 
text.  Compared  with  text,  noise  blocks  often  have  a  simple  stroke  complexity.  Therefore, 
crossing-count  histogram  features  are  exploited  to  model  such  differences.  We  aslo  take 
regions  of  machine  printed  text,  handwriting,  and  noise  blocks  as  different  textures. 
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Table  5:  Features  used  for  machine  printed  text /handwriting/ noise  classification 


Feature  set 

Feature  description 

of  features 

of  features  selected 

Structural 

Region  size,  connected  components 

18 

9 

Gabor  filter 

Stroke  orientation 

16 

4 

Run-length  histogram 

Stroke  length 

20 

5 

Crossing-count  histogram 

Stroke  complexity 

10 

6 

Bi-level  co-occurrence 

Texture 

16 

2 

2x2  gram 

Texture 

60 

5 

Total 

140 

31 

Figure  17:  Illustration  of  feature  extraction,  (a)  The  overlap  area  of  the  connected 
components  inside  a  pattern  unit  is  extracted  as  a  structural  feature,  (b)  Run-length 
histogram  features,  (c)  Crossing-count  features.  The  crossing  counts  of  the  top  and 
bottom  horizontal  scan  lines  are  1  and  2,  respectively,  (d)  Bi-level  2x2  gram  features. 
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Two  sets  of  bi-level  texture  features  (bi-level  co-occurrence  features  and  bi-level  2x2 
gram  features)  are  used  for  classification.  In  the  following  subsections  we  present  these 
features  in  detail. 


4.3.2  Structural  Features 


We  extract  two  sets  of  structural  features.  The  first  set  includes  features  related  to  the 
physical  size  of  the  blocks,  such  as  density  of  black  pixels,  width,  height,  aspect  ratio, 
and  area.  Suppose  the  image  of  the  block  is  I(x,  y),  0  <  x  <  w,  0  <  y  <  h,  and  w,  h  are 
its  width  and  height,  respectively.  Each  pixel  in  the  block  has  two  values:  0  represents 
background  (a  white  pixel)  and  1  represents  content  (a  black  pixel).  Then  the  density  of 
the  black  pixels  d  is 

w — 1  h—  1 


d 


E  E  i{x,y) 

x=0  y= 0 


W  X  h 


(40) 


The  sizes  of  machine  printed  words  are  more  consistent  than  those  of  handwriting  and 
noise  on  the  same  page.  However,  machine  printed  words  on  different  pages  may  vary 
significantly.  Therefore,  we  use  a  histogram  technique  to  estimate  the  dominant  font 
size  [16],  then  use  the  dominant  font  size  to  normalize  the  width  (to),  height  (h),  aspect 
ratio  (r),  and  area  (a)  of  the  block. 

The  second  set  of  structural  features  is  based  on  the  connected  components  inside 
the  block,  such  as  the  mean  and  variance  of  the  width  (rnw  and  aw),  height  ( rrih  and 
cpj),  aspect  ratio  (mr  and  ay),  and  area  (ma  and  cra )  of  connected  components.  The  sizes 
of  connected  components  within  a  machine  printed  word  are  more  consistent,  leading 
to  smaller  aw  and  oy.  For  a  handwritten  word  or  noise  block,  the  bounding  boxes  of 
the  connected  components  tend  to  overlap  with  each  other,  as  shown  in  Fig.  17a.  For 
machine  printed  English  words,  however,  each  character  forms  a  connected  component 
not  overlapping  with  others.  The  overlapping  area  (the  sum  of  the  areas  of  the  gray 
rectangles  in  Fig.  17a)  normalized  by  the  total  area  of  the  block  is  calculated  as  a  feature. 
We  also  use  the  variance  of  the  vertical  projection  as  a  feature.  In  a  machine  printed 
text  block,  the  vertical  projection  profile  has  obvious  valleys  and  peaks  since  neighboring 
characters  do  not  touch  each  other.  However,  for  a  handwritten  word  or  noise  block,  the 
vertical  projections  are  much  smoother,  resulting  in  smaller  variance. 


4.3.3  Gabor  Filter  Features 


Gabor  filters  can  represent  signals  in  both  the  frequency  and  time  domains  with  minimum 
uncertainty  [87]  and  have  been  widely  used  for  texture  analysis  and  segmentation  [88]. 
Researchers  found  that  they  match  the  mammalian  visual  system  very  well,  which  pro¬ 
vides  further  evidence  that  we  can  use  it  in  classification  tasks.  In  the  spatial  and 
frequency  domains,  the  two-dimensional  Gabor  filter  is  defined  as 


g(x,y)  =  exp 


—  7T 


x'2  y 12 
“w  H - v 


L  Gc 


<7: 


y 


x  cos{27r(u0a;  +  v0y )} 


(41) 
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G(u,v)  =  2naxay(exp{—7:[(u'  -  u'0)2(t2x  +  (v'  -  )2cr^]}  + 

exp{-n[(u'  +  u0)2a2x  +  (v'  +  v’0)2a2}})  (42) 

where  x'  =  —x  sin  9  +  y  cos  9,  y'  =  —x  cos  9  —  y  sin  9,  u!  —  u  sin  9  —  v  cos  9,  v'  =  —u  cos  9  — 
v  sin  9 ,  Mg  =  —  u0  sin  9  +  v0  cos  9 ,  v'0  =  —  u0  cos  9  —  v0  sin  9,  u0  —  f  cos  9 ,  and  v0  =  f  sin  9. 
Here  /  and  9  are  two  parameters,  representing  the  central  frequency  and  orientation  of 
the  Gabor  filter. 

The  variances  of  the  filtered  images  are  taken  as  features.  In  our  experiments  16 
Gabor  filters  with  different  orientations  9k  —  k  x  180/IV,  k  —  1,2, ...  16,  are  used,  which 
generate  16  features. 

4.3.4  Run- length  Histogram  Features 

Run-length  histogram  features  are  proposed  in  [23]  for  machine  printed/handwritten 
Chinese  character  classification.  These  features  are  used  in  our  case  to  capture  the 
difference  between  the  stroke  lengths  of  machine  printed  text,  handwriting,  and  noise 
blocks.  First,  black  pixel  run-lengths  in  four  directions,  including  horizontal,  vertical, 
major  diagonal,  and  minor  diagonal,  are  extracted.  We  then  calculate  four  histograms 
of  run-lengths  for  these  four  directions,  as  shown  in  Fig.  17b.  To  get  scale-invariant 
features,  we  normalize  the  histograms.  Suppose  C*,,  k  —  1,  2, ...,  N,  is  the  number  of  runs 
with  length  k,  and  N  is  the  maximal  length  of  all  possible  runs,  then  the  normalized 
histogram  C'k  is 

C’k  =  -W-  (43) 

EC, 

i= 1 

We  then  divide  the  histogram  into  five  bins  with  equal  width  and  use  five  Gaussian-shaped 
weight  windows  to  get  the  final  features  (Fig.  17b).  Taking  the  horizontal  run-length 
histogram  as  an  example,  the  run-length  histogram  feature  Rhi  is  calculated  as 

W 

Rhi  —  (j(fc;  Uj,  cr)C'k ,  i  =  1,2, 3, 4, 5  (44) 

k= 1 

where  w  is  the  width  of  the  block  (the  maximal  length  of  all  possible  horizontal  run- 
lengths)  and  G(k]Ui,a )  is  a  Gaussian-shaped  function: 

G(k]Ui,a)  =  expS^-^k  j  (45) 

As  shown  in  Fig.  17b,  o  is  chosen  so  the  weight  on  each  bin  border  is  0.5.  Another 
alternative  is  to  use  rectangular  windows  without  overlap  between  neighboring  bins. 
Experiments  show  that  the  extracted  features  with  Gaussian  weighted  windows  are  more 
robust.  Five  features  are  extracted  in  each  direction,  leading  to  20  features. 

4.3.5  Crossing-Count  Histogram  Features 

A  crossing  count  is  the  number  of  times  the  pixel  value  changes  from  0  (white  pixel)  to 
1  (black  pixel)  along  a  horizontal  or  vertical  raster  scan  line.  As  shown  in  Fig.  17c,  the 
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crossing  counts  of  the  top  and  bottom  horizontal  scan  lines  are  1  and  2,  respectively. 
Crossing  counts  can  be  used  to  measure  stroke  complexity  [23,89].  In  our  approach,  first 
the  crossing  count  for  each  horizontal  and  vertical  scan  line  is  calculated.  Similarly,  we 
get  two  histograms  for  the  horizontal  and  vertical  crossing  counts  respectively.  The  same 
technique  (as  in  extracting  the  run-length  histogram  features)  is  exploited  to  get  the  final 
features  from  the  histograms.  A  total  of  10  features  are  extracted. 

4.3.6  Bi-level  Co-occurrence  Features 

A  co-occurrence  count  is  the  number  of  times  a  given  pair  of  pixels  occurs  at  a  fixed 
distance  and  orientation  [90].  In  the  case  of  binary  images,  the  possible  co-occurrence 
pairs  are  white-white,  black-white,  white-black,  and  black-black.  In  our  case,  we  are 
concerned  primarily  with  the  foreground.  Since  the  white  background  region  often  ac¬ 
counts  for  up  to  80%  of  a  document  page,  the  occurrence  frequency  of  white-white  or 
white-black  pixel  pairs  will  always  be  much  higher  than  that  of  black-black  pairs.  The 
black-black  pairs  carry  most  of  the  information.  To  eliminate  the  redundancy  and  re¬ 
duce  the  effects  of  over-emphasizing  the  background,  we  consider  only  black-black  pairs. 
Four  different  orientations  (horizontal,  vertical,  major  diagonal  and  minor  diagonal)  and 
four  distance  levels  (1,  2,  4,  and  8  pixels)  are  used  to  classify  (16  features  total).  The 
horizontal  co-occurrence  count  Ch(d),  for  example,  is  defined  as 

Ch(d)  =  ^2^2l(x,y)I(x  +  d,y),d  =  1,2, 4, 8  (46) 

x  y 

I(x,y )  =  0  for  white  pixels;  therefore  only  black-black  pixel  pairs  contribute.  For  a  fixed 
distance  d,  we  normalize  the  occurrence  by  dividing  by  the  sum  of  the  occurrences  in  all 
four  directions. 

4.3.7  Bi-level  2x2  gram  Features 

The  NxM  grams  were  first  introduced  in  the  context  of  image  classification  and  re¬ 
trieval  [91].  An  NxM  gram  extends  the  one-dimensional  co-occurrence  feature  to  the 
two-dimensional  case.  We  only  consider  2x2  grams,  which  count  the  numbers  of  oc¬ 
currences  of  the  patterns  shown  in  Fig.  17d.  The  cells  labeled  0/1  should  take  specific 
values,  and  the  values  of  other  cells  are  irrelevant.  Therefore,  there  are  24  =  16  patterns 
for  each  distance  d.  Like  the  co-occurrence  features,  the  all  white  patterns  are  removed 
to  reduce  over-emphasis  on  the  background.  For  a  fixed  distance,  the  occurrences  are 
normalized  by  dividing  by  the  sum  of  all  occurrences.  Four  distances  (1,  2,  4,  and  8 
pixels)  are  chosen,  generating  4  x  15  =  60  features. 

4.3.8  Feature  Selection 

There  are  two  purposes  for  feature  selection.  The  first  involves  reducing  the  computation 
needed  for  feature  extraction  and  classification.  As  shown  in  Table  5,  we  extract  a  total  of 
140  features  from  the  segmented  blocks.  Though  these  features  are  designed  to  distinguish 
between  different  types  of  blocks,  some  features  may  contain  more  information.  Using 
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only  a  small  set  of  the  most  powerful  features  reduces  the  time  for  feature  extraction  and 
classification.  The  second  purpose  is  to  alleviate  the  curse  of  dimensionality  problem. 
When  the  number  of  training  samples  is  limited,  using  a  large  feature  set  may  decrease 
the  generality  of  a  classifier  [92],  The  larger  the  feature  set,  the  more  training  samples 
are  needed.  Therefore,  we  perform  feature  selection  before  feeding  the  features  to  the 
classifier. 

We  use  a  forward  search  algorithm  to  perform  feature  selection  [93].  We  first  divide 
the  whole  feature  set  T  into  a  currently  selected  feature  set  and  an  un-selected  feature 
set  „  which  satisfy 


njFn  =  <h 
Ts  U  Tn  =  T 

The  selection  procedure  can  then  be  described  as 

1.  Set  J-’s  =  <h,  and  J-n  =  T . 

2.  Label  all  features  in  Tn  as  un-tested. 

3.  Select  one  un-tested  feature  /  G  J-»  and  label  it  as  tested. 

4.  Put  /  and  together  and  generate  a  temporary  selected  feature  set  jFsf. 

5.  Estimate  the  classification  accuracy  with  feature  set  T\  using  a  1-NN  classifier 
and  leave-one-out  cross  validation  technique.  Basically,  at  each  iteration  only  one 
sample  is  used  for  testing,  while  the  others  are  used  for  training.  We  repeat  this 
process  until  all  samples  have  been  used  as  testing  samples  once.  The  average 
accuracy  for  all  iterations  is  taken  as  the  estimated  accuracy  for  the  current  feature 
set.  The  leave-one-out  cross  validation  technique  can  estimate  the  accuracy  of  a 
classifier  with  small  variation  [92], 

6.  If  there  are  un-tested  features  in  JPn,  go  to  step  3. 

7.  Find  a  feature  /  G  JPn,  such  that  the  corresponding  temporary  feature  set  Jy  has 
the  highest  classification  accuracy: 

f  —  arg  max  Accuracy  (^y)  (49) 


(47) 

(48) 


then  move  /  from  to  JFS. 

8.  If  J-n  ^  <f>,  go  to  step  2;  otherwise  exit. 

We  use  LNKnet  pattern  classification  software  to  conduct  our  feature  selection  experi¬ 
ments  [94],  LNKnet  provides  several  classifiers,  such  as  likelihood  classifiers,  k-NN  clas¬ 
sifiers,  and  neural  network  classifiers,  and  several  feature  selection  algorithms  such  as 
forward  search,  backward  search,  and  forward  and  backward  search.  Feature  selection 
can  be  an  extremely  expensive  task.  Considering  the  large  number  of  feature  sets  to 
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(b) 


Figure  18:  Feature  analysis,  (a)  Feature  selection:  the  best  classification  result  is  achieved 
when  31  features  are  selected,  (b)  PCA:  the  best  classification  result  is  achieved  when 
64  principal  components  are  used. 


evaluate,  and  the  number  of  classifiers  to  train,  the  lightweight  forward  feature  selec¬ 
tion  algorithm  and  1-NN  classifier,  which  does  not  need  training,  are  used  in  our  feature 
selection  experiment. 

We  collected  about  1,500  blocks  for  each  class.  As  shown  in  Fig.  18a,  when  the  number 
of  selected  features  increases,  the  error  rate  decreases  sharply  at  first.  The  trend  reverses 
at  some  point.  The  best  classification  is  achieved  when  only  31  features  are  selected,  with 
an  error  rate  of  5.7%.  When  all  features  are  used,  the  error  rate  increases  to  9.2%  due  to 
the  limited  number  of  training  samples  and  large  feature  set.  The  last  column  in  Table  5 
lists  the  number  of  features  selected  in  each  set.  It  shows  that  texture  features,  such  as 
bi-lcvcl  co-occurrence  and  2x2  grams,  are  less  discriminating  than  other  feature  sets, 
mainly  because  of  the  small  region  size.  Only  1/8  of  the  bi-level  co-occurrence  features 
and  1/12  of  the  2x2  gram  features  are  selected.  Crossing-count  histogram  features  and 
structural  features  are  more  effective,  with  more  than  half  of  the  original  features  in  both 
sets  selected  in  the  final  feature  set. 

Principal  component  analysis  (PCA)  is  another  technique  for  reducing  feature  dimen¬ 
sion  [92],  To  extract  the  first  n  principal  components,  we  need  to  search  a  subspace  of 
dimension  n  with  basis  w.  Suppose  the  mean  is  already  removed  from  the  feature  vector 
X,  and  let  the  projection  of  X  onto  this  subspace  be  X 

X  =  (wf  X)wi  +  (wj  X)w2  +  . . .  +  (w^X)wn  (50) 

PCA  finds  the  optimal  subspace  w  such  that  the  energy  contained  in  X  is  maximized: 


w  =  arg  max  >  Var 

Wl,..,W n 

2—1 


X, 


s.t.  w;  w j  = 


1  if  i—j 
0  if  t  ^  j 


(51) 


The  optimal  basis  is  the  first  n  eigenvectors  of  the  covariance  matrix  of  X,  corresponding 
to  the  first  n  eigenvalues  [92] .  The  first  n  principal  components  are  P*  =  wj  X,  i  — 
1, . . . ,  n.  The  idea  of  PCA  is  to  concentrate  the  energy  into  the  first  several  principal 
components.  Assuming  the  classification  information  is  contained  in  the  energy,  the 
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first  several  principal  components  are  more  powerful  than  the  remaining  components. 
Furthermore,  PCA  analysis  can  remove  the  correlation  among  features. 

As  in  the  feature  selection  experiment,  we  use  the  1-NN  classifier  and  the  leave-one- 
out  technique  to  estimate  the  classification  accuracy.  Fig.  18b  shows  the  classification 
error  rate  versus  the  number  of  principal  components  used.  As  in  feature  selection,  the 
error  rate  reduces  quickly  at  first  until  16  principal  components  added.  The  minimal 
error  rate,  8.5%,  is  achieved  when  64  principal  components  are  used.  Compared  with 
the  minimum  error  rate  of  5.7%  achieved  by  the  feature  selection  technique,  PCA  is 
not  as  powerful  as  feature  selection  in  this  problem.  Furthermore,  to  perform  PCA,  all 
features  must  be  extracted  first.  However,  for  feature  selection,  we  only  need  to  extract 
the  desired  features,  which  would  increase  the  feature  extraction  speed.  Therefore,  in  the 
following,  we  do  classification  on  the  31  selected  features. 


4.3.9  Classification 

Compared  with  the  neural  network  (NN)  and  the  support  vector  machine  (SVM),  the 
Fisher  linear  discriminant  classifier  is  easier  to  train,  faster  to  classify,  needs  fewer  train¬ 
ing  samples,  and  does  not  suffer  from  the  over-training  problems.  According  to  the 
comparison  experiment  in  Subsection  4.5.2,  the  SVM  classifier  performs  slightly  better 
than  the  Fisher  linear  discriminant  classifier,  but  the  latter  is  much  faster.  We  therefore 
use  it  for  classification.  For  a  feature  vector  X,  the  Fisher  linear  discriminant  classifier 
projects  X  onto  one  dimension  Y  in  direction  W 

Y  =  WTX  (52) 


The  Fisher  criterion  finds  the  optimal  projection  direction  W„  by  maximizing  the  ratio 
of  the  between-class  scatter  to  the  within-class  scatter,  which  benefits  the  classification. 
Let  Sw  and  S&  be  the  within-  and  between-class  scatter  matrices  respectively, 


I\ 

s„,  =  y  y  (x  -  ufc)(x  -  uk)T  (53) 

k=1  xeclass  k 


K 

Sb  =  -  U0)(ufc  -  u0)T 

k=l 


k= 1 


(54) 

(55) 


where  is  the  mean  vector  of  the  kth  class,  u0  is  the  global  mean  vector,  and  K 
is  the  number  of  classes.  The  optimal  projection  direction  is  the  eigenvector  of  S“1S 5 
corresponding  to  its  largest  eigenvalue  [92],  For  a  two-class  classification  problem,  we  do 
not  need  to  calculate  the  eigenvectors  of  S“1Sf).  It  is  shown  that  the  optimal  projection 
direction  is 


W0  =  Sw1(u1  -  u2) 


(56) 
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Let  Yi  and  Y2  be  the  projections  of  two  classes  and  let  E[Lj]  and  E [Y2]  be  the  means  of 
Yi  and  Y2  respectively.  Suppose  E[Li]  >  E[F2],  then  the  decision  can  be  made  as 


C'(X)  = 


class  1 
class  2 


If  Y  >  (E[l'j]  +  E[F2])/2 
Otherwise 


(57) 


It  is  known  that  if  the  feature  vector  X  is  jointly  Gaussian  distributed,  and  the  two 
classes  have  the  same  covariance  matrices,  then  the  Fisher  linear  discriminant  classifier 
is  optimal  in  a  minimum  classification  error  sense  [92], 

The  Fisher  linear  discriminant  classifier  is  often  used  for  two-class  classification  prob¬ 
lems.  Although  it  can  be  extended  to  multi-class  classification  (three  classes  here),  the 
classification  accuracy  decreases  due  to  the  overlap  between  neighboring  classes.  There¬ 
fore,  we  use  three  Fisher  linear  discriminant  classifiers,  each  optimized  for  a  two-class 
classification  problem  (machine  printed  text/handwriting,  machine  printed  text/noise, 
and  handwriting/noise).  Each  classifier  outputs  a  classification  confidence,  and  the  final 
decision  is  made  by  fusing  the  outputs  of  all  three  classifiers. 


4.3.10  Classification  Confidence 

In  a  Fisher  linear  discriminant  classifier,  the  feature  vector  is  projected  onto  an  axis  on 
which  the  ratio  of  between-class  scatter  to  within-class  scatter  is  maximized.  According 
to  the  central  limit  theorem  [51],  the  distribution  of  the  projection  can  be  approximated 
by  a  Gaussian  distribution,  if  no  feature  has  dominant  variance  over  the  others, 


fy{y) 


1 


(58) 


where  fy  (y)  is  the  probability  density  function  of  the  projection.  The  parameters  m  and 
cr  can  be  estimated  from  training  samples.  The  classification  confidence  CtJ  of  class  i 
using  classifier  j  is  defined  as 

„  f  — — - 1 — /vMX£(-lass  b  - 1 - -  If  i  is  applicable  for  classifier  j.  ,  . 

Citj  =  <  fY(y ixeclass  *)+jV(dxeanother  class)  ^  J  (59) 

I  0  Otherwise 


where  i  is  the  class  label  and  j  represents  the  trained  classifiers.  If  a  classifier  is  trained 
to  classes  1  and  2,  its  output  is  not  applicable  to  estimating  the  classification  confidence 
of  class  3.  Therefore,  C^j  =  0.  The  final  classification  confidence  is  defined  as 

1  3 

c.  =  2  X  C,J  (60) 

3= 1 

Cij  G  [0, 1]  for  the  two  applicable  classifiers  and  Ch]  =  0  for  the  third  classifier,  Ct  e  [0, 1]. 
However,  C{  is  not  a  good  estimate  of  the  a  posteriori  probability  since  ^3=1  Ci  =  1.5 
instead  of  1.  We  can  take  C*  as  an  estimate  of  a  non-decreasing  function  of  the  a  posteriori 
probability,  which  is  a  kind  of  generalized  classification  confidence  [95]. 
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Fig.  19  shows  the  word  segmentation  and  classification  results  (with  the  Fisher  linear 
discriminant  classifier)  for  the  whole  and  parts  of  a  document  image,  with  blue,  red,  and 
green  representing  machine  printed  text,  handwriting,  and  noise  respectively.  We  can  see 
that  most  of  the  blocks  are  correctly  classified.  However  some  blocks  are  misclassihed 
due  to  an  overlap  in  the  feature  space.  For  example,  some  noise  blocks  are  classified  as 
handwriting  in  Fig.  19b,  and  some  small  printed  words  are  classified  as  noise  in  Fig.  19c. 
Since  little  information  is  available  in  small  areas,  it  is  difficult  to  get  good  results.  In  the 
next  section,  we  present  a  method  of  Markov  random  held  (MRF)  based  post-processing 
to  refine  the  classification  by  incorporating  contextual  information. 

4.4  MRF-Based  Post-Processing 
4.4.1  Background 

Let  X  denote  the  random  held  defined  on  fi,  and  let  T  denote  the  set  of  all  possible 
configurations  of  X  on  fh  X  is  an  MRF  with  respect  to  the  neighborhood  p  if  it  has  the 
following  Markov  property 


Pr(X  =  x)  >  0  for  all  x  G  T  (61) 

P(xs\xr,r  G  fl,r  ^  s)  —  P(xs\xr,r  G  rj)  (62) 

Compared  with  Markov  chains,  one  difficulty  with  MRFs  is  that  they  have  no  chain 
rule.  The  joint  probability  P(X  =  x)  cannot  be  recursively  written  in  terms  of  local 
conditional  probabilities  P(xs\xr,r  G  rf).  Therefore,  it  is  difficult  to  get  an  optimal 
estimate  of  the  MRF  X  which  maximizes  the  a  posteriori  probability 

X  =  argmaxP(X|Y)  (63) 

The  establishment  of  the  connection  between  the  MRF  and  Gibbs  distribution  provides 
a  way  to  optimize  the  MRF.  To  maximize  the  a  posteriori  probability  of  the  MRF,  we 
need  to  minimize  the  total  energy  of  the  corresponding  Gibbs  distribution 

X  =  arg  min  ^  RC(X)  (64) 

cGC 

Here,  a  clique  c  is  defined  as  a  subset  of  sites  in  which  every  pair  of  distinct  sites  are 
neighbors.  The  clique  potential  Vj.(X)  is  the  energy  associated  with  a  clique  and  depends 
on  the  local  configuration  of  clique  c.  Therefore,  the  optimization  problem  (63)  is  con¬ 
verted  to  another  optimization  problem  (64).  The  information  about  the  observation  Y 
is  contained  in  the  clique  system. 

In  the  study  of  MRFs,  the  problems  are  often  posed  as  labeling  problems  in  which  a 
set  of  labels  are  assigned  to  sites  of  an  MRF  [69].  In  our  problem,  each  block  constitutes  a 
site  of  an  MRF.  A  label  (machine  printed  text,  handwriting,  or  noise)  is  assigned  to  each 
block,  and  context  information  (encoded  by  the  MRF  model)  is  used  to  flip  the  labels  so 
that  the  total  energy  of  the  corresponding  Gibbs  distribution  is  minimized.  Relaxation 
algorithms  are  often  used  for  MRF  optimization  [69]. 
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Figure  19:  Word  block  segmentation  and  classification  results,  with  blue,  red,  and  green 
representing  machine  printed  text,  handwriting,  and  noise  respectively,  (a)  A  whole 
document  image,  (b)  and  (c)  Two  parts  of  the  image  in  (a). 
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Figure  20:  Clique  definition,  (a)  Cp  for  horizontally  arranged  machine  printed  words, 
(b)  Cn  for  noise  blocks. 

4.4.2  Clique  Definition 

As  shown  in  (64),  the  MRF  is  totally  determined  by  clique  c  and  clique  potential  UC(X). 
The  design  of  the  clique  and  its  potential  is  crucial,  but  a  systematic  method  is  not  yet 
available.  In  our  case,  machine  printed  text,  handwriting,  and  noise  exhibit  different 
patterns  of  geometric  relationships.  Onr  definition  of  cliques  reflects  these  differences. 

Printed  words  often  form  horizontal  (or  vertical)  text  lines.  Clique  Cp  is  defined  in 
Fig.  20a,  which  models  contextual  constraints  on  neighboring  machine  printed  words. 
We  first  define  the  connection  between  word  blocks  i  and  j.  As  shown  in  Fig.  20a,  Ov  is 
the  vertical  overlap  between  two  blocks,  and  Dh  is  the  horizontal  distance  between  two 
blocks.  The  distance  between  block  i  and  j  is 

D(i,j)  =  \Dh(i,j)  -  Gw\  +  | Hi  -  Hj\  +  \ Chi  ~  Chj\  (65) 

where  Dh(i,j )  is  the  horizontal  distances  between  words  i  and  j,  Gw  is  the  estimated 
average  word  gap  in  the  whole  document,  Hi  and  Hj  are  the  heights  of  blocks  i  and  j 
respectively,  and  Chi  and  Chj  are  the  vertical  centers  of  the  two  blocks.  Two  blocks  are 
connected  if  they  satisfy 

1.  Ov  >  min(Hi,  Hj)/ 2 

2.  0  <  Dh  <  2 Gw 

3.  D(i,j )  <  Tp,  where  Tp  is  a  threshold,  which  is  not  sensitive  to  post-processing. 

After  defining  the  connection  between  two  blocks  we  can  construct  a  graph  in  which 
nodes  represent  blocks  and  edges  link  two  connected  nodes.  If  a  node  is  connected  with 
more  than  one  node  on  one  side  (left  or  right),  we  keep  only  the  edge  with  the  smallest 
distance.  Clique  Cp  can  be  represented  by  nodes  together  with  their  left  and  right 
neighbors.  If  we  cannot  find  neighbors  on  the  left  or/and  right  sides,  the  corresponding 
neighbor  is  set  to  NULL. 

Noise  blocks  exhibit  random  patterns  in  geometric  relationships  and  tend  to  overlap 
or  in  close  proximity.  As  shown  in  Fig.  20b,  the  noise  block  labeled  “Center”  is  overlapped 


50 


with  blocks  1,  2,  3,  and  is  close  to  block  4.  Clique  Cn  is  defined  primarily  for  noise  blocks. 
Similarly,  the  distance  between  two  blocks  is  defined  as 

D(i,j)  =ma  x(Dh(i,j),Dv(i,j))  (66) 

where  Dh{i,j)  =  max(Lj,  Lj)  —  min(i?j,  Rj),  Dv(i,j )  =  max(Tj,  Tj)  —  min(Sj,  Bj),  and  L, 
R ,  T,  B  are  the  left,  right,  top,  and  bottom  coordinates  of  the  corresponding  blocks.  If 
two  blocks  overlap  in  the  horizontal  or  vertical  direction,  then  Dh(i,j )  <  0  or  Dv{i,j )  <  0. 
Blocks  i  and  j  are  connected  if,  and  only  if,  D(i,j )  <  Tn,  where  Tn  is  a  threshold.  If  Tn  is 
too  big,  incorrect  label  flipping  of  noise  and  handwriting  between  two  printed  text  lines 
may  happen.  If  Tn  is  too  small,  the  contextual  constraints  on  the  noise  blocks  cannot  be 
used  fully.  We  set  Tn  as  half  of  the  dominant  character  height  (about  10  pixels  in  our 
experiments).  Each  node,  together  with  all  nodes  connected  to  it,  defines  clique  Cn.  The 
number  of  connected  nodes  may  vary  from  0  to  almost  10,  depending  on  the  size  of  the 
block.  As  an  approximation,  we  consider  only  the  first  four  nearest  connected  neighbors. 
If  the  number  of  neighbors  is  less  than  four,  we  set  the  corresponding  neighbors  to  NULL. 

The  geometric  constraint  on  handwriting  has  weaker  horizontal  or  vertical  structure 
than  machine  printed  words,  thus  it  is  partially  reflected  in  both  cliques  Cp  and  CJn. 
Therefore,  we  do  not  define  a  specific  clique  for  handwriting. 


4.4.3  Clique  Potential 


Clique  potential  is  the  energy  associated  with  a  clique.  We  assign  high  energy  to  an 
undesirable  configuration  of  the  clique  and  low  energy  to  a  preferred  configuration.  For 
example,  an  undesired  configuration  of  clique  Cp  (as  shown  in  Fig.  20a)  is  that  the  left 
and  right  blocks  are  labeled  as  printed  text  and  the  center  block  as  noise.  Flipping 
the  label  of  the  center  block  from  noise  to  printed  text  would  achieve  a  more  preferred 
configuration,  and  reduce  the  total  energy.  Another  undesirable  configuration  occurs 
when  all  blocks  are  labeled  as  printed  text  for  the  clique  Cn  in  Fig.  20b.  It  should  have 
higher  energy  than  the  configuration  in  which  all  blocks  are  labeled  as  noise.  In  many 
applications  the  clique  potentials  are  defined  ad  hoc.  One  systematic  way  is  to  define 
clique  potential  as  the  occurrence  frequency  of  each  clique  in  the  training  set,  which  can 
be  expressed  as  a  function  of  local  conditional  probabilities.  Based  on  this  idea,  we  define 
two  clique  potentials  Vp(c)  and  14(c)  for  cliques  Cp  and  Cn  as 

P(Xi,Xc,Xr) 

p[C)  (P(Xl)P(Xc)P(Xr))w  ’ 


14(c) 


(P(XC)P(X1)P(X2)P(X3)P(X4))^ 


(68) 


where  X/,  Xc  and  Xr  are  labels  for  the  left,  center,  and  right  blocks  of  clique  c,  w  is  a 
constant,  and  Xl}  i  =  1,2,  3, 4,  is  the  label  of  the  ith  nearest  block.  The  energy  of  the 
corresponding  Gibbs  distribution  is 


U (X|Y)  =  ^[-P(zs|r/s)]  +  wp  Vp(c)  +  (69) 

c£.Cp  CGCyx 
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where  ws,  wp,  and  wn  are  weights  which  adjust  the  relative  importance  of  classification 
confidence  and  contextual  information  for  cliques  Cp  and  Cn.  If  ws  —  1,  wp  —  0,  and 
wn  =  0,  no  contextual  information  is  used;  with  increase  in  wp  and  wn,  more  contex¬ 
tual  information  is  emphasized.  If  we  set  wp  =  wn  =  oo,  or  equivalently  ws  =  0,  no 
classification  confidence  is  used. 

In  the  following  experiments,  we  want  to  use  MRFs  for  word  block  labeling.  The 
number  of  handwritten  words  is  much  smaller  than  the  other  two  types,  leading  to  a  lower 
estimated  frequency  of  cliques  with  handwriting.  As  a  result,  the  optimization  tends  to 
label  handwritten  words  as  machine  printed  text  or  noise.  Therefore,  we  regularize  the 
estimated  clique  frequency  P(Xi,Xc,Xr)  and  P(XC,  Xi,  X2,  X3,  X4)  by  dividing  by  the 
product  of  the  probabilities  of  the  word  block  labels  which  comprise  the  clique.  The 
above  regularization  is  similar  to  the  previous  approach  [96],  where  w  is  set  to  1.  In 
our  case,  w  is  changeable;  increasing  w  will  emphasize  handwritten  words.  Our  clique 
potential  definition  is  systematic  and  can  be  optimized  for  different  applications. 

After  defining  the  cliques  and  the  corresponding  clique  potential,  we  can  search  the 
optimal  configuration  of  the  labels  of  all  blocks,  so  the  total  energy  of  the  correspond¬ 
ing  Gibbs  distribution  is  minimized.  Relaxation  algorithms  are  often  used  for  MRF 
optimization.  There  are  two  types  of  relaxation  algorithms:  stochastic  and  determin¬ 
istic  [69].  Stochastic  algorithms  can  always  converge  to  the  global  optimal  solution  if 
some  constraints  are  satisfied.  They  are,  however,  computationally  demanding.  Deter¬ 
ministic  algorithms  are  simpler,  but  only  converge  to  local  optimal  solutions  depending 
on  the  initial  value.  In  our  experiments,  highest  confidence  first  (HCF),  a  deterministic 
approach,  is  used  for  MRF  optimization  clue  to  its  fast  speed  and  good  performance  [97]. 
In  each  iteration  of  the  HCF  algorithm,  only  one  block  is  chosen  to  flip  its  label  such 
that  the  total  energy  reduces  the  largest.  It  repeats  this  procedure  until  no  single  flip¬ 
ping  can  further  reduce  the  total  energy.  Since  each  flipping  would  reduce  the  energy 
and  the  energy  is  bounded  below,  the  HCF  algorithm  converges  in  a  finite  number  of 
steps.  Fig.  21  is  an  example  of  the  refined  classification  results  after  post-processing. 
Compared  with  Fig.  19,  we  can  see  in  Figs.  21a  and  (b)  that  most  misclassihed  noise 
blocks  are  corrected,  with  a  few  exceptions  clue  to  their  having  fewer  constraints.  The 
misclassihed  small  machine  printed  words  are  all  corrected  in  Fig.  21c. 

4.5  Experiments 
4.5.1  Data  Set 

We  collected  a  total  of  318  business  letters  from  the  tobacco  industry  litigation  archives. 
These  document  images  are  noisy  with  a  significant  amount  of  handwritten  annotations 
and  signatures,  a  few  logos,  and  no  figures  or  tables.  Currently,  we  identify  three  classes: 
machine  printed  text,  handwriting,  and  noise.  We  used  224  images  for  training  and  the 
remaining  94  for  testing.  There  are  about  1,500  handwritten  words  in  the  training  set. 
Since  much  more  machine  printed  and  noise  blocks  are  present,  we  randomly  selected 
about  the  same  number  of  blocks  of  each  type  for  training.  We  used  accuracy  and 
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Figure  21:  Word  block  classification  results  after  post-processing  with  blue,  red,  and 
green  representing  machine  printed  text,  handwriting,  and  noise  respectively,  (a)  The 
whole  document  image,  (b)  and  (c)  Two  parts  of  the  image  in  (a). 
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Table  6:  Performance  comparison  of  three  different  classifiers:  the  k-NN  classifier,  the 
Fisher  linear  discriminant  classifier,  and  the  SVM  classifier. 


#  of 

blocks 

the  k-NN  classifier 

the  Fisher  classifier 

the  SVM  classifier 

Correct  | 

Accuracy  | 

Var  | 

Correct  | 

Accuracy  | 

Variance  | 

Correct  | 

Accuracy  | 

Var  | 

Printed  text 

1,519 

1,489 

98.0% 

1.4% 

1,473 

97.0% 

1.1% 

1,480 

97.4% 

1.2% 

Handwriting 

1,518 

1,390 

91.6% 

2.3% 

1,410 

92.9% 

2.2% 

1,435 

94.5% 

2.1% 

Noise 

1,512 

1,406 

93.0% 

2.0% 

1,451 

96.0% 

1.5% 

1,453 

96.1% 

1.2% 

Overall 

4,549 

4,285 

94.2% 

1.3% 

4,344 

95.5% 

0.9% 

4,368 

96.0% 

0.9% 

Table  7:  Single  word  block  classification. 


#  of  blocks 

Percentage 

#  of  correctly 
classified  blocks 

#  of  misclassified 
blocks 

Accuracy 

Precision 

Printed  text 

19,227 

66.9% 

18,446 

781 

95.9% 

99.5% 

Handwriting 

701 

2.4% 

653 

48 

93.2% 

62.9% 

Noise 

8,802 

30.7% 

8,522 

280 

96.8% 

93.0% 

Overall 

28,730 

100.0% 

27,621 

1,109 

96.1% 

N/A 

precision  as  metrics  to  evaluate  the  result: 


Accuracy  of  type  i 


44  of  correctly  classified  blocks  of  type  i 
44  of  blocks  of  type  i 


Precision  of  type  i 


44  of  correctly  classified  blocks  of  type  i 
44  of  blocks  classified  as  type  i 


(70) 

(71) 


4.5.2  Classifier  Comparison 

In  this  section,  we  compare  the  performance  of  three  different  classifiers:  the  k-NN  classi¬ 
fier,  the  Fisher  linear  discriminant  classifier,  and  the  SVM  classifier.  The  SVM  classifier 
is  based  on  VC  dimension  theory  and  structural  risk  minimization  theory  of  statistical 
learning  [98].  A  public  domain  SVM  tool,  LibSVM,  is  used  in  the  following  experi¬ 
ment  [99].  The  N-fold  ( N  =  10  in  our  experiment)  verification  technique,  a  variation 
of  the  leave-one-out  technique,  is  used  to  estimate  the  classification  accuracy.  Instead 
of  holding  one  sample  for  testing  at  each  iteration,  it  first  divides  the  data  set  into  N 
groups  ( N  =  10  in  our  experiment),  then  holds  one  group  of  samples  for  testing  and 
the  remaining  groups  for  training.  Table  6  shows  the  classification  accuracies  of  all  the 
classifiers.  We  can  see  that  the  SVM  classifier  achieved  the  highest  accuracy.  Considering 
the  large  variance,  the  improvement  is  not  significant.  The  variance  of  the  classification 
accuracy  of  all  classifiers  is  the  smallest  for  printed  text  and  the  largest  for  handwrit¬ 
ing,  indicating  that  the  printed  text  is  more  compact  in  the  feature  space.  Among  all 
three  classifiers,  the  Fisher  linear  discriminant  classifier  is  the  fastest  since  it  needs  only 
one  vector  multiplication  to  perform  a  classification.  Therefore,  we  use  the  Fisher  linear 
discriminant  classifier  for  the  rest  of  the  experiments. 

The  classification  result  on  the  test  set  of  94  images,  using  the  Fisher  linear  discrimi¬ 
nant  classifier,  is  shown  in  Table  7.  The  accuracies  on  all  three  classes  range  from  93.2% 
to  96.8%,  with  an  overall  accuracy  of  96.1%.  While  this  overall  accuracy  is  high,  we 
notice  that  the  precision  for  handwriting  is  low  (62.9%).  This  is  mainly  because  the 
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(a)  (b)  (c) 

Figure  22:  MRF-based  post-processing,  (a)  Number  of  corrected  blocks  using  clique  Cp. 
(b)  Number  of  corrected  blocks  using  clique  Cn.  (c)  Number  of  corrected  blocks  using 
clique  Cv  and  classification  confidence. 


number  of  handwritten  words  in  the  testing  set  is  small.  Even  small  percentages  of  mis- 
classification  of  machine  printed  text  and  noise  as  handwriting  will  significantly  decrease 
the  precision  of  handwriting. 

4.5.3  Post-processing  Using  MRFs 

In  the  following  experiments  we  investigate  how  MRFs  can  improve  classification  accu¬ 
racy.  In  the  first  run,  we  set  ws  —  0  ,  wn  —  0  and  wp  —  1  to  show  the  effectiveness  of  clique 
Cp.  Fig.  22a  shows  the  number  of  corrected  blocks,  which  were  previously  misclassified, 
with  change  in  w.  As  expected,  Cp  is  effective  for  machine  printed  words,  but  not  as  ef¬ 
fective  for  handwriting  and  noise.  When  w  =  0.3  (under  this  condition,  the  classification 
accuracy  of  all  three  classes  increases),  355  (46%)  of  the  previously  misclassified  machine 
printed  words  are  corrected.  When  w  increases,  handwriting  is  emphasized  more,  leading 
to  higher  classification  accuracy  of  handwriting,  and  lower  accuracy  of  machine  printed 
words  and  noise.  In  practice,  w  can  be  adjusted  to  optimize  the  overall  accuracy. 

In  the  second  run,  we  test  the  effectiveness  of  clique  Cn  by  setting  ws  =  0,  wp  =  0, 
and  wn  =  1.  As  shown  in  Fig.  22b,  clique  Cn  effectively  corrects  classification  errors  of 
noise  blocks.  The  classification  error  of  noise  blocks  is  greatly  reduced  when  w  is  small. 
For  w  =  0.6  (under  this  condition,  the  classification  accuracy  of  all  classes  increases),  the 
number  of  misclassified  noise  blocks  is  reduced  by  99  (35%).  Cn  can  also  correct  some 
classification  errors  of  machine  printed  words,  but  is  less  effective  than  Cp  as  shown  in 
Fig.  22a. 

The  third  run  tests  the  effectiveness  of  classification  confidence  for  post-processing. 
Fig.  22c  shows  post-processing  results  by  adjusting  wp  when  w  =  0.3,  wn  =  0,  and 
ws  =  1.  Adjusting  wp  will  change  the  total  flip  number.  When  wp  =  0,  the  energy 
reaches  the  minimum  with  the  initial  labels,  and  the  total  flip  number  is  0.  When 
wp  increases,  more  emphasis  is  placed  on  contextual  information,  and  the  flip  number 
increases.  When  wp  — >  +oo,  it  converges  to  the  case  of  wp  =  1  and  ws  =  0,  the  setting 
of  the  first  run.  The  maximal  overall  classification  accuracy  is  achieved  when  wp  =  6. 
Compared  with  the  first  run,  the  total  number  of  corrected  blocks  increases  from  389  to 
424  by  incorporating  classification  confidence.  Similar  results  are  achieved  by  combining 
classification  confidence  with  clique  Cn. 
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Table  8:  Word  block  classification  after  MRF  based  post-processing. 


#  of  blocks 

#  of 
correctly 
classified 
blocks 

#  of 

misclassified 

blocks 

Reduction  of 
misclassified 
blocks 

Error 

reduction 

rate 

Accuracy 

Precision 

Printed  text 

19,227 

18,835 

392 

389 

49.8% 

98.0% 

99.7% 

Handwriting 

701 

652 

49 

-1 

-2.1% 

93.0% 

83.3% 

Noise 

8,802 

8,682 

120 

160 

57.1% 

98.6% 

96.0% 

Total 

28,730 

28,169 

561 

548 

49.4% 

98.1% 

N/A 

In  the  last  run,  we  fix  ws  =  1  and  manually  adjust  w,  wp,  and  wn  to  optimize  the 
overall  classification  accuracy.  The  final  parameters  we  chose  are  w  =  0.39,  wp  =  5,  and 
wn  =  4.  Table  8  shows  the  results  after  post-processing.  The  “Error  Reduction  Rate”  in 
Table  8  is  defined  as  follows: 


Error 

Reduction 

Rate 


#  of  Errors  Before  Post-Processing  —  #  of  Errors  After  Post-Processing 
#  of  Error  Before  Post-Processing 


(72) 

The  error  rate  reduces  to  about  half  of  the  original  for  both  machine  printed  text 
and  noise,  but  increases  slightly  for  handwriting.  However,  compared  with  Table  7,  the 
precision  of  handwriting  increases  from  62.9%  to  83.3%  due  to  fewer  machine  printed 
text  and  noise  mis-classifications  as  handwriting.  The  overall  accuracy  increases  from 


96.1%  to  98.1%. 


Fig.  23  shows  an  example  of  machine  printed  text  and  handwriting  identification 
from  noisy  documents.  To  display  the  classification  results  clearly,  we  decompose  the 
classified  image  into  three  layers,  representing  machine  printed  text  (Fig.  23b),  hand¬ 
writing  (Fig.  23c),  and  noise  (Fig.  23d)  respectively.  The  result  is  good  with  few  mis- 
classifications. 

Our  approach  is  general  and  can  be  extended  to  other  languages  with  minor  modi¬ 
fication.  Fig.  24  shows  identification  results  for  a  Chinese  document.  In  Chinese,  there 
is  no  clear  definition  of  words  and  no  spaces  between  neighboring  words.  Therefore,  the 
parameters  of  our  word  segmentation  module  are  adjusted  to  get  characters.  We  only 
need  to  retrain  the  classifiers;  the  post-processing  module  is  intact.  We  can  see  that  most 
handwriting  and  noise  blocks  are  classified  correctly,  but  several  machine  printed  digits 
are  misclassified  as  handwriting.  On  the  right  margin  of  the  document,  some  machine 
printed  text  is  identified  as  noise  due  to  touching. 

Our  approach  is  fast;  the  averaging  processing  time  for  a  business  letter  scanned  at 
300  dpi  is  about  2-3  seconds  on  a  PC  with  1.7  GHz  CPU  and  1.0  GB  memory. 


4.6  Application  to  Zone  Segmentation  in  Noisy  Images 

The  proposed  method  is  not  limited  to  extract  handwriting  from  a  heterogeneous  docu¬ 
ment.  After  classification,  we  can  output  different  contents  into  different  layers.  By  sep¬ 
arating  noise,  the  layer  of  machine  printed  text  becomes  cleaner  than  the  original  noisy 
document.  Therefore,  our  approach  can  be  used  as  a  document  enhancement  procedure, 
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Figure  23:  An  example  of  machine  printed  text  and  handwriting  identification  from 
noisy  documents,  (a)  The  original  document  image,  (b)  Machine  printed  text,  (c) 
Handwriting,  (d)  Noise.  The  logo  is  classified  as  noise  since  currently  we  only  consider 
three  classes. 
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Figure  24:  An  example  of  machine  printed  text  and  handwriting  identification  from 
Chinese  documents,  (a)  Original  Chinese  document  image,  (b)  Machine  printed  text, 
(c)  Handwriting,  (d)  Noise. 
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which  facilitates  further  document  image  analysis  tasks,  such  as  zone  segmentation  and 
OCR, 

In  this  section,  we  show  that  our  method  can  improve  general  zone  segmentation 
results  after  removing  identified  noise.  We  evaluated  two  widely  used  zone  segmentation 
algorithms:  the  Docstrum  algorithm  [16]  and  ScanSoft  SDK,  a  commercial  OCR  software 
package  [100].  Many  different  zone  segmentation  evaluation  metrics  have  been  proposed 
in  previous  work.  Kanai  et  al.  [101]  evaluated  zone  segmentation  accuracy  from  the 
OCR  aspect.  Any  zone  splitting  and  merging,  if  it  does  not  affect  the  reading  order  of 
the  text,  is  not  penalized.  The  approach  of  Mao  and  Kanungo  [102]  is  based  on  text 
lines,  which  penalizes  only  horizontal  text  line  splitting  and  merging,  since  it  will  change 
the  reading  order  of  text.  Randriamasy  et  al.  [103]  proposed  an  evaluation  method  based 
on  multiple  ground  truth,  which  is  very  expensive.  Liang’s  approach  is  performed  at  the 
zone  level  [75].  After  finding  the  correspondence  between  the  segmented  and  ground- 
truthed  zones,  any  large  enough  difference  is  penalized.  We  use  Liang’s  scheme  in  our 
experiment  since  we  focus  on  zone  segmentation.  From  the  OCR  perspective,  vertical 
splitting  or  merging  of  different  zones  should  not  be  penalized  even  when  these  zones  have 
different  physical  and  semantic  properties.  From  the  point  of  view  for  zone  segmentation, 
it  should  be  penalized. 

There  are  1,374  machine  printed  text  zones  in  94  noisy  document  images.  The  ex¬ 
perimental  results  are  shown  in  Table  9.  All  merging  and  splitting  errors  are  counted  as 
partially  correct  in  the  table.  Before  noise  removal,  ScanSoft  has  very  poor  results,  an 
accuracy  of  15.9%,  on  noisy  documents  under  this  metric.  After  analyzing  the  segmen¬ 
tation  results,  we  found  that  ScanSoft  tends  to  merge  horizontally  arranged  zones  into 
one  zone,  which  is  suitable  for  documents  with  simple  layouts  such  as  technical  articles, 
but  not  for  other  document  types  such  as  business  letters.  The  Docstrum  algorithm  out¬ 
puts  many  more  zones  than  ScanSoft,  resulting  in  a  higher  accuracy  (53.0%),  but  also  a 
higher  false  alarm  rate  (114.1%).  After  noise  removal,  the  accuracy  of  both  algorithms 
increases  significantly,  from  15.9%  to  48.4%  for  ScanSoft  and  from  53.0%  to  78.0%  for  the 
Docstrum  algorithm.  The  false  alarm  rate  is  reduced  from  32.5%  to  1.3%  for  ScanSoft 
and  from  114.1%  to  7.9%  for  Docstrum. 

Fig.  25  shows  the  zone  segmentation  results  for  two  noisy  documents  with  the  Doc¬ 
strum  algorithm  before  and  after  noise  removal.  The  handwriting  is  output  to  another 
layer,  not  shown  here.  After  noise  removal,  we  see  fewer  splitting  and  merging  errors, 
and  overall  the  segmentation  results  have  significantly  improved. 

4.7  Summary  and  Future  Work 

In  this  chapter,  we  have  presented  an  approach  to  segmenting  and  identifying  handwrit¬ 
ing  from  machine  printed  text  in  extremely  noisy  document  images.  Instead  of  using 
simple  filtering  rules,  we  treat  noise  as  a  distinct  class.  We  use  statistical  classification 
techniques  to  classify  each  block  into  machine  printed  text,  handwriting,  and  noise.  We 
then  use  Markov  random  fields  to  incorporate  contextual  information  for  post-processing. 
Experiments  show  that  MRFs  are  an  effective  tool  for  modeling  local  dependency  among 
neighboring  image  components.  After  post-processing,  the  classification  error  rate  is 
reduced  by  approximately  50%. 
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(d) 


Figure  25:  Zone  segmentation  before  and  after  noise  removal  using  the  Docstrum  algo¬ 
rithm.  (a)  and  (c)  show  the  results  before  noise  removal,  (b)  and  (d)  are  the  results  after 
noise  removal. 
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Table  9:  Machine  printed  zone  segmentation  experimental  results  on  94  noisy  document 
images  (totally  1,374  zones),  before  and  after  noise  removal. 


Before  noise  removal 

After  noise  removal  | 

Correctly 

segmented 

zones 

False 

alarm 

zones 

Partially 

correctly 

segmented 

zones 

Missed 

zones 

Correctly 

segmented 

zones 

False 

alarm 

zones 

Partially 

correctly 

segmented 

zones 

Missed 

zones 

ScanSoft 

219 

(15.9%) 

446 

(32.5%) 

1,148 

(83.7%) 

7 

(0.5%) 

665 

(48.4%) 

18 

(1.3%) 

671 

(48.8%) 

38 

(2.8%) 

Docstrum 

728 

(53.0%) 

1,568 

(114.1%) 

646 

(47.0%) 

0 

(0.0%) 

1,071 

(78.0%) 

109 

(7.9%) 

270 

(19.7%) 

(2.4%) 

Our  method  is  general  enough  to  be  extended  to  documents  in  some  other  scripts, 
such  as  Chinese,  by  re-training  the  classifier,  ffowever,  our  approach  does  not  apply 
in  a  straightforward  manner  to  cursive  scripts  such  as  Arabic.  Two  observations  used 
to  discriminate  handwriting  from  machine  printed  text  for  English  documents  do  not 
hold  for  Arabic  documents.  (1)  Handwriting  is  more  cursive  than  machine  printed  text 
in  English  documents.  However,  machine  printed  Arabic  text  is  cursive  in  nature.  (2) 
People  like  to  connect  several  characters  during  writing.  However,  many  machine  printed 
Arabic  characters  are  also  connected.  Preliminary  experiments  using  the  same  feature  set 
proposed  in  this  chapter  were  performed  for  Arabic  documents,  resulting  in  a  low  classi¬ 
fication  accuracy  at  word  level.  New  features  should  be  designed  for  Arabic  documents. 
It  should  be  much  easier  to  distinguish  handwriting  from  machine  printed  text  in  Arabic 
at  the  text  line  level.  However,  how  to  reliably  extract  text  line  from  a  heterogeneous 
and  noisy  document  is  challenging  problem  in  itself. 
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5  Handwriting  Matching 
5.1  Introduction 

Handwriting  samples  (the  same  content)  are  often  produced  with  large  deformation  by 
different  persons  or  the  same  person  at  different  times.3  Due  to  the  difficulty  in  re¬ 
producing  a  handwriting  sample  exactly,  handwriting  is  often  used  to  identify  a  per¬ 
son.  On  the  other  hand,  large  deformation  is  a  challenge  for  other  handwriting  related 
applications  such  as  handwriting  recognition  and  retrieval.  To  study  the  deformation 
characteristics,  it  is  crucial  to  automatically  establish  the  correspondence  between  two 
handwriting  samples,  which  have  the  following  applications: 

1.  It  is  much  easier  to  define  similarity  measures  between  handwriting  samples  after 
establishing  the  correspondence.  The  measures  are  often  more  robust  and  have 
more  distinguishing  power,  comparing  to  the  case  where  we  do  not  know  the  cor¬ 
respondence  [24,104,105]. 

2.  The  deformation  characteristics  learned  can  be  used  to  synthesize  visually  realistic 
handwriting  samples  [106].  The  quality  of  a  trained  statistical  model  for  pattern 
classification  depends  highly  on  the  quality  of  the  training  set.  As  a  rule,  the  more 
samples  used  for  training,  the  higher  the  generalization  capability  of  the  trained 
classifier.  Since  collecting  a  large  volume  of  handwriting  samples  is  generally  very 
expensive,  synthesized  samples  are  often  used  to  enlarge  the  training  set  [107,108]. 

In  this  chapter,  we  study  the  handwriting  matching  problem.  In  the  next  two  chapters, 
we  will  apply  our  matching  algorithm  to  handwriting  synthesis  and  retrieval,  respectively. 

We  study  handwriting  matching  in  a  broader  context  of  shape  matching,  which  is 
often  encountered  in  image  analysis,  computer  vision,  and  pattern  recognition.  A  shape 
may  be  represented  as  a  set  of  features  at  different  levels,  such  as  points,  line  segments, 
curves,  or  surfaces.  Shape  matching  may  be  performed  on  these  representations.  The 
survey  paper  by  Loncaric  [109]  covers  the  extraction  and  representation  of  a  shape. 
Different  distance  definitions  between  two  features  (i.e.,  point,  lines,  or  curves)  and  their 
use  in  shape  matching  can  be  found  in  [110].  In  general,  the  higher  the  level  of  a  feature, 
the  more  difficult  it  is  to  extract  the  feature  reliably.  The  extraction  of  interesting 
points,  for  example,  is  easy  (sometimes  trivial),  and  it  is  more  general  since  lines  and 
surfaces  can  be  discretized  as  a  set  of  points.  Although  such  discretization  is  by  no 
means  optimal,  reasonable  matching  results  may  be  achieved  in  many  cases  [111].  Point 
matching,  therefore,  is  often  used  in  applications  such  as  pose  estimation  [112,  113], 
medical  image  registration  [114],  surface  registration  [115,116],  object  recognition  [24], 
and  handwriting  recognition  [104,105]. 

In  our  approach,  we  first  extract  the  handwriting  skeleton,  then  uniformly  sample  a 
set  of  points  from  the  skeleton.  After  that,  we  develop  an  algorithm  to  estimate  the  cor¬ 
respondence  between  two  point  sets.  Compared  to  the  original  pixel-based  representation 
(which  can  be  seen  as  a  representation  with  dense  points),  our  approach  demands  fewer 

3In  this  chapter,  we  exclude  the  variation  produced  by  different  contents  (e.g.,  the  difference  between 
handwriting  of  letters  ’a’  and  ’b’)-  Throughout  this  chapter,  we  focus  on  the  deformation  in  shape. 
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points.  Several  hundred  points  are  enough  to  represent  the  structure  of  handwriting.  An¬ 
other  advantage  is  our  representation  is  more  robust  to  stroke  width  variation,  which  is 
often  introduced  by  the  use  of  different  writing  tools  or  different  digitization  parameters 
(e.g.,  different  parameters  for  scanning  and  binarization) .  Our  point  matching  algorithm 
uses  no  or  little  prior  knowledge  of  handwriting,  and  is  general  enough  to  be  applied  to 
other  point  pattern  based  nonrigid  shape  matching  problems.  In  the  remaining  of  this 
chapter,  we  describe  our  algorithm  in  the  broader  context  of  nonrigid  shape  matching , 
instead  of  the  narrower  handwriting  matching. 

5.1.1  Overview  of  Our  Approach 

Although  the  absolute  distance  between  two  points  may  change  significantly  under  non- 
rigid  deformation,  the  neighborhood  structure  of  a  point  is  generally  well-preserved  due 
to  physical  constraints.  For  example,  a  human  face  is  a  nonrigid  shape,  but  the  relative 
position  of  chin,  nose,  mouth,  and  eyes  cannot  deform  independently  due  to  underlying 
constraints  of  bones  and  muscles.  These  physical  constraints  restrict  the  deformation  of 
the  point  set  sampled  from  a  face.  The  rough  structure  of  a  shape  is  typically  preserved, 
otherwise  even  people  cannot  match  shapes  reliably  under  arbitrary  large  deformation. 
Such  constraints  may  be  represented  as  the  ordering  of  points  on  a  curve.  Sebastian  et 
al.  [117]  demonstrated  the  effectiveness  of  point  ordering  in  matching  curves,  but  for  gen¬ 
eral  shapes  other  than  curves,  local  point  ordering  is  difficult  to  describe,  and  is  ignored 
in  many  point  matching  algorithms  [24],  As  a  major  contribution,  we  formulate  point 
matching  as  an  optimization  problem  to  preserve  local  neighborhood  structures.  In  ad¬ 
dition  to  the  physical  constraint  explanation,  our  approach  is  supported  from  cognitive 
experiments  of  human  shape  perception.  Strong  evidence  suggests  the  early  stages  of 
human  visual  processing  is  local,  parallel,  and  bottom-up,  though  feedback  may  be  nec¬ 
essary  in  later  stages.  Preserving  local  neighborhood  structures  is  important  for  people 
to  detect  and  recognize  shapes  efficiently  and  reliably  [118,119]. 

In  our  approach,  we  formulate  point  matching  as  an  optimization  problem  to  preserve 
local  neighborhood  structures  during  matching.  Our  formulation  has  a  simple  graph 
matching  interpretation,  where  each  point  is  a  node  in  the  graph  and  two  nodes  are 
connected  by  an  edge  if  they  are  neighbors.  The  optimal  match  between  two  graphs  is 
the  one  that  maximizes  the  number  of  matched  edges  (i.e.,  the  number  of  neighborhood 
relations).  Graph  matching  is  an  NP-hard  problem.  Exhaustive  or  branch-and-bound 
search  for  a  global  optimal  solution  is  only  realistic  for  graphs  with  few  nodes.  As 
an  alternative,  a  discrete  optimization  problem  can  be  converted  to  a  continuous  one, 
allowing  several  continuous  optimization  techniques  to  be  applied  [120,  121],  In  our 
approach,  we  use  the  shape  context  distance  to  initialize  graph  matching,  followed  by  a 
relaxation  labeling  process  to  refine  the  match. 

5.1.2  Previous  Work 

Shapes  can  be  roughly  categorized  as  rigid  or  nonrigid,  and  the  realization  of  a  shape 
may  undergo  various  deformations  in  captured  images.  With  a  small  number  of  transfor¬ 
mation  parameters  (six  for  a  2-D  affine  transformation),  rigid  shape  matching  under  the 
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affine  [111,115]  or  projective  transformation  [113]  is  relatively  easy  and  has  been  widely 
studied  with  an  extensive  literature.  Since  it  is  impossible  to  sufficiently  discuss  previous 
publications  without  omitting  many  excellent  works,  we  refer  the  reader  to  other  survey 
papers  for  a  large  bibliography  [122,123].  Although  many  point  matching  algorithms 
developed  for  rigid  shapes  can  tolerate  some  degree  of  noise  or  local  distortions,  large 
free-form  deformation  presents  a  significant  challenge.  Recently,  point  matching  for  non- 
rigid  shapes  has  received  more  and  more  attention.  In  the  following  literature  review,  we 
will  focus  on  publications  on  nonrigid  shape  matching. 

Point  matching  for  nonrigid  shapes  is  problematic  because  the  method  must  com¬ 
pensate  for  both  linear  distortions  (i.e.,  translation,  rotation,  scale  changes,  and  shear) 
and  non-linear  distortions.  Therefore,  the  common  framework  of  iterated  correspondence 
and  transformation  estimation  is  used  widely.  The  iterated  closest  point  (ICP)  algorithm 
is  a  well-known  heuristic  approach  proposed  by  Besl  and  McKay  [111,115].  Assuming 
two  shapes  are  roughly  aligned,  for  each  point  in  one  shape,  the  closest  point  in  the 
other  shape  is  taken  as  the  current  estimate  of  the  correspondence.  The  affine  trans¬ 
formation  estimated  with  the  current  correspondence  will  then  bring  two  shapes  closer. 
ICP  was  later  extended  for  nonrigid  free-form  surfaces  [116].  The  framework  consists  of 
three  stages.  First,  the  rigid  displacement  is  estimated  using  surface  curvatures.  Sec¬ 
ond,  the  global  affine  transformation  is  estimated  using  the  ICP  algorithm.  Third,  a 
local  affine  transformation  (LAT)  is  attached  to  each  point  to  deform  the  surface  locally. 
Wakahara  [104]  used  LAT  to  match  and  recognize  handwritten  characters.  A  dynamic 
window  with  a  gradually  decreasing  size  is  used  to  estimate  the  local  affine  transforma¬ 
tion  for  a  point.  This  approach  was  later  improved  by  combining  global  and  local  affine 
transformations  to  increase  the  robustness  [105]. 

Although  LAT  has  enough  flexibility  to  model  local  nonrigid  deformation,  no  stan¬ 
dard  exists  to  define  the  neighborhood  window  size  to  estimate  the  parameters  of  LAT. 
How  to  combine  the  global  and  local  affine  transformations  is  an  open  problem  as  well, 
so  more  flexible  deformation  models  with  closed-form  representations  are  desired.  In  the 
literature  on  interpolation  and  approximation,  radial  basis  functions  (RBF)  with  different 
kernel  functions,  such  as  the  thin  plate  spline  (TPS)  [124]  and  the  Gaussian  RBF  [125], 
are  widely  used.  Recently,  the  TPS  deformation  model  began  to  be  applied  in  point 
matching  [24, 25]  because  it  can  be  formulated  as  an  optimal  solution  of  the  bending  of 
a  thin  plate  [124],  Chui  and  Rangarajan  [25]  proposed  an  optimization  based  approach, 
the  TPS-RPM  algorithm.  The  TPS  model’s  bending  energy  and  the  average  Euclidean 
distance  between  two  point  sets  are  combined  in  an  objective  function.  The  soft  assign¬ 
ment  technique  and  deterministic  annealing  algorithm  are  used  to  search  for  the  optimal 
solution,  which  significantly  outperforms  the  ICP  algorithm  on  nonrigid  point  match¬ 
ing.  Belongie  et  al.  [24]  proposed  another  method  for  nonrigid  point  matching.  In  this 
approach,  a  shape  context  is  assigned  to  a  given  point,  which  describes  the  relative  distri¬ 
bution  of  the  other  points  in  the  shape.  After  defining  the  similarity  between  two  points 
based  on  their  shape  contexts,  the  Hungarian  algorithm  [126]  searches  for  the  optimal 
match  between  the  two  point  sets.  Similarly,  the  TPS  model  brings  two  shapes  closer  in 
each  iteration. 

More  recently,  Glaunes  et  al.  [127]  proposed  another  point  matching  approach.  Taking 
a  point  set  as  a  sampling  of  the  underlining  distribution,  they  proposed  a  theory  based  on 
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diffeomorphisms  on  distributions.  Their  formulation  reduces  to  an  optimization  problem 
with  a  weighted  summation  of  two  parts:  the  energy  associated  with  the  deformation 
and  the  distance  between  two  point  sets  under  this  deformation.  This  is  similar  to 
the  objective  function  in  [25],  although  no  explicit  deformation  model  is  assumed.  The 
variational  method  is  used  to  search  for  the  optimal  deformation.  Experimental  results 
on  synthesized  data  are  encouraging,  but  more  extensive  tests  should  be  performed  to 
show  the  effectiveness  of  their  approach. 

Another  interesting  work  is  matching  articulated  objects  [112].  An  articulated  object 
(such  as  a  person)  is  composed  of  several  rigid  segments  connected  by  pivot  points.  The 
deformation  of  rigid  segments  can  be  modeled  with  an  affine  transformation.  A  global 
hierarchical  search  strategy  searches  for  and  matches  pivot  points,  and  local  matching 
of  rigid  segments  is  used  to  prune  the  search  tree,  thus  reducing  the  computational 
cost  [112]. 

The  remainder  of  this  chapter  is  organized  as  follows.  In  Section  5.2,  we  formu¬ 
late  point  matching  as  an  optimization  problem.  Our  strategy  to  search  for  an  optimal 
solution  is  described  in  Section  5.3.  Shape  deformation  models,  such  as  the  affine  trans¬ 
formation  and  TPS,  are  discussed  in  Section  5.4,  followed  by  a  brief  summary  of  our 
approach  in  Section  5.5.  We  demonstrate  the  robustness  of  our  approach  with  experi¬ 
ments  in  Section  5.6,  and  the  chapter  concludes  with  a  discussion  of  the  future  work  in 
Section  5.7. 

5.2  Problem  Formulation 

In  this  section,  we  formulate  point  matching  as  an  optimization  problem.  Suppose  a 
template  shape  T  is  composed  of  M  points,  St  =  (T),  T2,  •  •  • ,  TM},  and  a  deformed 
shape  D  is  composed  of  N  points,  So  =  {D\,  D2,  ■  ■  ■ ,  D^}.  It  is  a  common  practice 
to  enforce  the  one-to-one  matching  constraint  in  point  matching,  so  the  point  sets  St 
and  So  are  augmented  to  S'T  =  {Tj,  T2,  •  •  • ,  Tm,  nil}  and  S'D  =  {D±,  D2,  •  •  • ,  -Djv,  nil} 
respectively,  by  introducing  a  dummy  or  nil  point.  A  match  between  shapes  T  and  D 
is  /  :  S'T  S'D,  where  the  matching  of  normal  points  is  one-to-one,  but  multiple  points 
may  be  matched  to  a  dummy  point. 

Under  a  rigid  transformation  (i.e.,  translation  and  rotation),  the  distance  between 
any  pair  of  points  is  preserved.  Therefore,  the  optimal  match  /  is 

/  =  arg  inin  C(T,D,f),  (73) 

where 

M  M 

C(T,D,f)  =  J2  J2  (HT”>  “  T‘ll  “  HB/(->  “  -D/dl)2 

771=1  i=  1 

N  N 

+  E  E  (»B»  -  DiW  ~  -  r/-‘0)ll)2  ■  (U) 

n=  1  j= 1 

In  this  cost  function,  we  penalize  any  matching  error  which  does  not  preserve  the  distance 
of  a  point  pair.  If  M  =  N  and  no  points  are  matched  to  dummy  points,  the  Erst  term 
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and  the  second  term  in  (74)  should  be  equal,  and  the  optimal  match  should  achieve  zero 
penalty,  C(T,  D,  f)  =  0.  Points  matching  a  dummy  point  need  special  treatment,  and  to 
simplify  the  representation,  we  do  not  describe  such  treatment  here.  We  will  return  to 
this  issue. 

If  nonrigid  deformation  is  present,  the  distance  between  a  pair  of  points  will  not  be 
preserved,  especially  for  points  which  are  far  apart.  On  the  other  hand,  due  to  physical 
constraints  and  in  order  to  preserve  the  rough  structure,  the  local  neighborhood  of  a 
point  may  not  change  freely.  We,  therefore,  define  a  neighborhood  for  point  i  as  A/).  The 
neighborhood  relationship  is  symmetric,  meaning  if  j  E  Mi  then  i  E  Mj.  Since  we  want 
to  preserve  the  distances  of  neighboring  point  pairs  under  deformation,  (74)  becomes 


C(T,D,f)  =  27  57  (nrm  -  Till  -  -  i>/0)ll)2 

m= 1  ieAfm 
N 

+  EE  (l|£>„-r>j||-||r/-.w-T/-.U)||)2.  (75) 

n=  1  jeATn 


The  absolute  distance  of  a  pair  of  points  is  not  preserved  well  under  scale  changes. 
Therefore,  we  quantize  the  distance  to  two  levels  as 


II Tm  -  Till 


0  m  E  Mi 

1  m  £  Mi 


and  1 1  Dn  —  J):j 


0  n  E  Mj 
1  n  ^  Mj 


(76) 


(75)  then  is  simplified  to 


M 


N 


c(t,dj)  =  77  E  + 


(77) 


m=  1  i€A/7 


n= 1  jeAfn 


where 


0  j  E  Mi 

i  j  i  Mi 


(78) 


To  deal  with  points  matched  to  a  dummy  point,  we  let  d(.,nil )  =  d(nil, .)  =  d(nil,  nil )  = 
1  to  discourage  the  match. 

In  the  following,  we  rewrite  the  objective  function  of  (77),  and  interpret  it  as  a  graph 
matching  problem.  First,  we  subtract  a  constant  term  from  C(T,  D,  f)- 


M  N 

C'(T,D,f )  =  C(T,  D,  f)  —  EE'-EE1 

m= 1  ieAfm  77—1  jeATn 

M  N 

=  /(*))  - !]  +  K/_1(n)’  /_1(i))  - 1] 

m=l  ieAIm  n=l  jeATn 

M  N 

=  /(*))  -  (79) 

m= 1  ieATm  n=l  jeATn 
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(b) 


Figure  26:  A  point  set  (a)  and  its  graph  representation  (b). 


where 

S(i,j)  =  l-d(i,j)  (80) 

Minimizing  C(T,D,f)  is  equivalent  to  minimizing  C'(T,  D,  f)  since  the  difference  be¬ 
tween  them  is  a  constant.  Therefore,  the  minimization  problem  of  (73)  is  equivalent  to 
the  following  maximization  problem. 

/  =  arg  max  S(T,D,  f )  (81) 


where 

M  N 

S(T,D,f)  =  J2  Ef(/W,/P) (82) 

m=li£j\fm  n=l  j£j\fn 

This  formulation  has  a  simple  graph  matching  interpretation.  Each  point  is  a  node  in  the 
graph,  and  two  nodes  are  connected  by  an  edge  if  they  are  neighbors.  The  dummy  node 
is  not  connected  to  other  nodes  in  the  graph.  If  connected  nodes  m  and  i  in  one  graph 
are  matched  to  connected  nodes  f(m)  and  f(i)  in  the  other  graph,  ))  =  1. 

The  optimal  solution  of  (81)  is  the  one  that  maximizes  the  number  of  matched  edges  of 
two  graphs. 

No  obvious  neighborhood  definition  exists  for  a  point  set.  In  the  following  we  present 
a  simple  neighborhood  definition.4  Initially,  the  graph  is  fully  connected,  then  we  remove 
long  edges  until  a  pre-defined  number  of  edges  are  preserved.  Supposing  M  nodes  in  the 
graph  and  on  average  each  point  has  Eave  neighbors,  then  the  number  of  preserved  edges 
is  M  x  Eave/ 2  ( Eave  =  5  in  default).  With  this  neighborhood  definition,  the  graph 
representation  of  a  point  set  is  translation,  rotation,  and  scale  invariant.  Fig.  26  shows  a 
point  set  and  its  graph  representation.  We  expect  points  connected  with  an  edge  move 
together  under  deformation,  so  the  structure  of  the  graph  is  preserved. 

Graph  matching  (more  generally,  attributed  relational  graph  matching)  is  used  in  [128] 
and  [129]  to  match  road  maps  extracted  from  aerial  photographs.  Our  graph  definition 
differs  from  theirs,  where  road  intersections  are  nodes  in  the  graph  and  two  nodes  are 
connected  by  an  edge  if  a  road  appears  between  two  intersections.  Such  a  graph  definition 

4Our  framework  is  general  enough  to  incorporate  other  neighborhood  definitions.  Please  refer  to 
Section  5.6  for  a  definition,  which  is  robust  under  nonuniform  scale  changes  for  different  parts  of  a 
shape. 


67 


is  natural  for  a  road  map,  but  errors  in  road  detection  will  change  the  graph  structure.  In 
our  case,  given  an  arbitrary  set  of  points,  there  is  no  such  natural  definition  of  connections 
among  points.  Graph  matching  is  widely  used  in  many  fields  such  as  computer  vision 
and  pattern  recognition.  There  are  various  kinds  of  graph  structures,  and  many  different 
metrics  are  available  in  the  literature  to  evaluate  a  match  between  two  graphs  [130]. 
Our  graph  representation  and  the  corresponding  matching  metric  are  derived  from  the 
observation  (or  assumption)  that  nonrigid  deformation  will  not  change  the  neighborhood 
of  a  point  significantly. 

5.2.1  Matching  Matrix 

We  can  represent  the  matching  function  /  in  (82)  with  a  set  of  supplemental  variables, 
which  can  be  organized  as  a  matrix  P  with  dimension  (M  +  1)  x  (. N  +  1). 


If  point  Tm  in  the  template  shape  T  is  matched  to  point  Dn  in  the  deformed  shape  D, 
then  Pmn  =  1;  otherwise  Pmn  =  0.  The  last  row  and  column  of  P  represent  the  case  that  a 
point  may  be  matched  to  a  dummy  point.  Matrix  P  satisfies  the  following  normalization 
conditions 

JV+l 

y  Pmn  =  1  for  m  =  1,2,  •  •  •  ,M,  (84) 

n= 1 
Af+1 

y  Pmn  =  1  for  n  =  1,2,---,  2V.  (85) 

771=  1 

Using  matrix  P,  the  objective  function  (82)  can  be  written  as 

M  N 

S(T,  D,  P)  =  2  J2  E  E  E  p™pv  <86> 

m= 1  ieJ^m  n=l  jeAfn 

5.3  Searching  for  an  Optimal  Solution 

Since  Pmn  G  {0, 1},  searching  for  an  optimal  P  that  maximizes  S(T,  D,  P)  is  a  difficult 
discrete  combinatorial  problem.  In  our  approach,  we  use  relaxation  labeling  to  solve  the 
optimization  problem,  where  the  condition  Pmn  G  {0, 1}  is  relaxed  as  Pmn  G  [0, 1]  [121]. 
After  relaxation,  Pmn  is  a  real  number,  and  the  problem  is  converted  to  a  constrained 
optimization  problem  with  continuous  variables. 

5.3.1  Matching  Initialization 

The  performance  of  relaxation  labeling  depends  heavily  on  the  initial  value  of  the  match¬ 
ing  probability  matrix  P.  We  need  a  good  initial  measure  of  the  matching  probabilities. 
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Figure  27:  Shape  context  of  a  point,  (a)  Basic  shape  context,  (b)  Our  rotation  invariant 
shape  context.  The  point  labeled  with  *  is  the  mass  center  of  the  point  set. 

One  option  involves  assigning  an  attribute,  such  as  the  color  or  intensity  gradients  of  the 
pixel,  to  a  point  if  it  is  extracted  as  a  pixel  in  an  image  [131].  We  can  then  compute 
the  similarity  between  a  pair  of  points,  and  convert  it  to  a  measure  of  the  matching 
probability.  If  a  set  of  points  is  given  without  any  additional  information,  the  shape 
context  provides  an  effective  way  to  compute  the  similarity  between  two  points  [24],  In 
our  approach,  we  use  the  shape  context  distance  to  initialize  the  point  matching  proba¬ 
bilities.  If  other  attributes  of  a  point  are  available,  they  can  be  easily  incorporated  into 
our  framework. 

To  extract  the  shape  context  of  a  point,  an  array  of  bins  is  placed  around  the  point, 
as  shown  in  Fig.  27a.  The  number  of  points  inside  each  bin  is  calculated  as  the  context 
of  this  point.  Therefore,  the  shape  context  of  a  point  is  a  measure  of  the  distribution 
of  other  points  relative  to  it.  Bins  that  are  uniform  in  log-polar  space  are  used  to  make 
the  descriptor  more  sensitive  to  positions  of  nearby  points  than  to  those  of  points  far 
away.  Five  bins  for  the  radius  and  12  bins  for  the  rotation  angle  are  used  throughout 
our  experiments.  Consider  two  points,  m  in  one  shape,  and  n  in  the  other  shape.  Their 
shape  contexts  are  hm{k )  and  hn{k ),  for  k  =  1,2,  respectively.  Let  Cmn  denote 

the  cost  of  matching  these  two  points.  As  shape  contexts  are  distributions  represented 
as  histograms,  it  is  natural  to  use  the  y2  test  statistic  [24] 


[hm(k)  -  hn(k)f 
h"m  (. k )  +  hn(k) 


The  Gibbs  distribution  is  widely  used  in  statistical  physics  and  image  analysis  to  relate 
the  energy  of  a  state  to  its  probability  [68].  Taking  the  cost  Cmn  as  the  energy  of  the 
state  that  points  m  and  n  are  matched,  the  probability  of  the  match  is 


Parameter  Tinit  is  used  to  adjust  the  reliability  of  the  initial  probability  measures,  where 
Timt  G  [0.05,0.1]  is  appropriate  according  to  our  experiments.  We  set  the  probability 
for  a  point  matching  to  a  dummy  point,  Prn,mi  or  Pnutn,  to  0.2.  Experiments  show  that 
our  approach  is  not  sensitive  to  this  parameter.  Fig.  28a  shows  the  initial  matching 
probability  matrix  P  of  two  shapes. 

Obviously,  shape  context  is  translation  invariant.  Using  bin  arrays  with  an  adaptive 
size  according  to  the  mean  point  distance  of  a  shape,  the  shape  context  is  scale  invariant 
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Figure  28:  Point  matching  probability  matrix  P.  The  matching  probabilities  to  a  dummy 
point  are  not  shown.  White  represents  a  high  probability,  (a)  Initial  probabilities  using 
the  shape  context  distance,  (b)  After  300  iterations  of  relaxation  labeling  updates. 


too  [24],  but  it  is  sensitive  to  large  rotations.  In  some  applications,  rotation  invariance  is 
required.  Our  graph  representation  is  rotation  invariant,  so  we  need  a  rotation  invariant 
initialization  scheme.  A  complete  rotation  invariant  shape  context  was  proposed  using  the 
tangent  direction  at  each  point  as  the  positive  x-axis  for  the  local  coordinate  system  [24], 
One  drawback  of  this  approach  is  that  the  tangent  direction,  defined  for  gray-scale  images, 
does  not  apply  for  binary  images.  Furthermore,  if  only  the  point  set  is  given  without 
accessing  the  original  image,  we  cannot  estimate  the  tangent  direction.  Another  drawback 
is  that  as  a  first-order  derivative  operation,  the  estimate  of  the  tangent  direction  is 
sensitive  to  noise.  Instead,  in  our  approach,  we  use  the  mass  center  of  a  point  set  as  a 
reference  point  and  use  the  direction  from  a  point  to  the  mass  center  as  the  positive  ;r-axis 
for  the  local  coordinate  system.  Fig.  27b  shows  our  rotation  invariant  shape  context.  If 
there  is  zero  mean  white  noise  in  point  position  measurements,  after  averaging,  the  effect 
of  noise  to  the  mass  center  is  reduced.  Therefore,  our  approach  is  more  robust  than  the 
tangent  direction  based  approach  under  noise. 

5.3.2  Relaxation  Labeling 

Relaxation  labeling  was  first  proposed  in  a  seminal  paper  by  Rosenfeld  et  al.  in  the 
mid-1970s  [121].  The  basic  idea  is  to  use  iterated  local  context  updates  to  achieve  a 
globally  consistent  result.  The  contextual  constraints  are  expressed  in  the  compatibility 
function  Rmn(i,j ),  which,  in  our  case,  measures  the  strength  of  compatibility  between 
Trn  matching  Dn  and  7)  matching  Dj.  The  support  function  Smn  measures  the  overall 
support  the  match  between  points  Tm  and  Dn  receives  from  its  neighbors. 

Smn  —  E  E  Rynnij'  •>  j)Pij  (89) 
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The  original  updating  rule  is5 


p  — 

1  m.n.  •  — 


P m,n  Sn 


mn^mn 


Ylj= 1  PmjS, 


mj 


(90) 


The  denominator  enforces  the  normalization  constraint.  Traditionally,  only  the  one-way 
normalization  constraint,  Eq.  (84),  is  enforced  in  relaxation  labeling. 

In  the  original  paper,  Smn  is  defined  heuristically.  Although  with  ad  hoc  heuristic 
arguments,  a  variety  papers  later  reported  on  the  practical  usefulness  of  the  algorithm 
(see  [132]  for  a  review  and  an  extensive  bibliography).  The  success  in  real  applications 
and  the  heuristic  flavor  of  the  algorithm  motivated  investigators  to  establish  a  theoretic 
foundation.  There  are  two  approaches.  Some  have  tried  to  set  the  labeling  problem 
within  a  probabilistic  framework  using  Bayesian  analysis  [128,133].  The  Bayesian  theory 
can  explain  only  one  iteration  of  the  relaxation  process.  An  alternative  explicitly  defines 
some  quantitative  measure  of  consistency  to  be  maximized,  then  formulates  the  labeling 
problem  as  one  of  optimization  [134,135].  Projected  gradient  methods  are  often  used  to 
optimize  the  objective  function.  In  these  theories,  the  support  function  Srnn  is  defined 
as  the  derivative  of  the  objective  function  with  respect  to  Pmn  [135].  The  updating  rule 
of  the  projected  gradient  methods  is 


P:=P  +  7Q(S)  (91) 

where  7  is  the  updating  step  and  S'  is  a  matrix  composed  of  elements  Smn.  Q(S )  is  a 
projection  operation  of  S  to  limit  the  range  of  Pmn  to  [0, 1]  and  enforce  normalization 
constraints.  In  the  case  of  boundary  points  (i.e.,  having  at  least  one  component  of  the 
probabilities  equal  to  zero  or  one),  the  projection  operation  is  much  more  complicated 
and  the  procedure  becomes  computationally  expensive.  Furthermore,  the  updating  step 
7  is  difficult  to  tune.  An  increase  in  the  objective  function  is  guaranteed  only  when 
infinitesimal  steps  are  taken,  and  searching  for  the  optimal  step  size  in  each  iteration  is 
computationally  expensive.  Recently,  Pelillo  [136]  showed  that  the  original  updating  rule 
in  (90)  does  converge  to  a  local  minimum  if  1)  the  objective  function  is  a  polynomial  with 
nonnegative  coefficients,  and  2)  Smn  is  defined  as  a  gradient  of  the  objective  function. 
The  advantages  of  this  updating  rule,  compared  with  the  projected  gradient  methods,  are 
1)  computationally  expensive  projection  operations  are  avoided,  and  2)  it  is  parameter 
free.  We  tried  several  updating  rules  compared  in  [137]  and  found  that  the  updating 
formula  of  (90)  is  robust  and  achieves  better  results.  With  our  objective  function  of  (86), 
Smn  takes  the  form  of 

Smn  =  4  X]  P*i  (92) 

i^Mrn  j(zJ\fn 

Since  Smn  >  0,  the  constraint  that  Pmn  G  [0, 1]  is  satisfied  after  normalization.  Inter¬ 
preted  in  the  relaxation  labeling  theory,  our  compatibility  function  is  R,mn(i,  j)  =  1  if  a 
pair  of  neighbors  Tm  and  T)  are  matched  to  a  pair  of  neighbors  Dn  and  Dj]  otherwise 
Rm.nihj)  =  0.  We  can  easily  show  that  our  objective  function  corresponds  to  the  average 

5 In  the  original  paper,  the  support  function  S  is  defined  in  a  heuristic  way  in  the  range  of  [—1,1]. 
In  order  to  satisfy  P  >  0  after  updating,  1  +  S  is  used  to  substitute  S  in  both  the  numerator  and 
denominator  in  the  updating  rule  [121]. 
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local  consistency  measure  of  the  Hummel-Zucker  consistency  theory  [135].  According  to 
the  consistency  theory,  each  update  step  will  increase  the  overall  consistency  of  the  sys¬ 
tem.  Each  unambiguous  consistent  solution  is  a  local  attractor.  Starting  from  a  nearby 
point,  the  relaxation  labeling  process  will  converge  to  it  [135,136].  Although  there  is  no 
guarantee  that  the  updating  process  will  converge  to  an  unambiguous  solution  starting 
from  an  arbitrary  initialization,  our  experiments  show  that  most  elements  of  matrix  P 
do  converge  to  zero  or  one. 

In  previous  applications  of  the  relaxation  labeling  technique,  a  many-to-one  match 
is  allowed  [138-142],  Only  a  one-way  normalization  constraint,  either  (84)  or  (85),  is 
enforced.  Unfortunately,  in  many  applications,  one-to-one  match  is  desired.  Projected 
gradient  methods  may  be  modified  to  enforce  one-to-one  match.  The  projection  opera¬ 
tion,  however,  is  computationally  expensive,  and  it  is  unclear  how  to  find  a  projection 
satisfying  two-way  normalization  constraints.  In  our  approach,  a  different  approach  based 
on  alternated  row  and  column  normalizations  of  the  matching  probability  matrix  P  is 
used  to  enforce  one-to-one  match  [25].  A  nonnegative  square  matrix  with  each  row  and 
column  summing  to  one  is  called  a  doubly  stochastic  matrix.  Sinkhorn  [143]  showed 
that  the  iterative  process  of  alternated  row  and  column  normalizations  will  convert  a 
matrix  with  positive  elements  to  a  doubly  stochastic  matrix.  The  conclusion  can  be  ex¬ 
tended  to  a  non-square  matrix  with  positive  elements.  We  call  a  matrix  where  each  row 
and  column  (except  the  last  row  and  column)  sums  one  a  generalized  doubly  stochastic 
matrix.  We  can  show  that  alternated  row  and  column  normalizations  (except  the  last 
row  and  column)  of  a  matrix  with  positive  elements  will  result  in  a  generalized  doubly 
stochastic  matrix.  Please  refer  to  the  Appendix  of  this  chapter  for  the  proof.  This  tech¬ 
nique  was  also  used  in  the  soft  assignment  point  matching  approach  without  proof  [25]. 
Though  relaxation  labeling  with  one-way  normalization  constraint  can  be  theoretically 
well  founded  [136],  it  is  not  clear  whether  the  updating  process  will  converge  to  a  local 
optimum  and  increase  the  consistency  after  imposing  the  one-to-one  matching  constraint. 
We  found  the  updating  process  still  converges  through  experiments,  but  we  cannot  prove 
it  theoretically. 

Fig.  28a  shows  the  initial  value  of  the  point  matching  probability  matrix  P  of  two 
shapes.  After  each  relaxation  labeling  update,  we  perform  alternated  row  and  column 
normalizations  to  matrix  P.  Generally,  a  few  rounds  will  bring  a  matrix  close  to  a 
generalized  doubly  stochastic  matrix.  After  300  iterations  of  relaxation  labeling  updates, 
the  ambiguity  of  matches  decreases.  As  shown  in  Fig.  28b,  most  elements  of  the  matrix 
converge  to  zero  or  one. 

After  relaxation  labeling  updates,  points  with  maximum  matching  probability  less 
than  Pmin  ( Pmin  —  0.95)  are  labeled  as  outliers  by  matching  them  to  dummy  points.  The 
matched  point  pairs  are  used  to  estimate  the  parameters  of  the  affine  or  TPS  deformation 
model,  and  the  estimated  parameters  are  used  to  transform  the  template  shape  to  bring 
it  closer  to  the  deformed  shape.  In  some  application  scenarios  (e.g.,  the  experiments 
in  Section  5.6.1),  we  may  want  to  fold  as  many  matches  as  possible.  Unfortunately, 
the  ratio  of  points  matched  to  dummy  points  by  the  relaxation  labeling  updates  cannot 
be  controlled  directly.  After  several  iterations  of  correspondence  and  transformation 
estimations,  two  point  sets  may  be  close  to  each  other.  Therefore,  in  the  last  round,  we 
ford  the  optimal  one-to-one  match  by  minimizing  the  summation  of  Euclidean  distances 


72 


from  the  transformed  template  shape  to  the  deformed  shape. 

M 

Dist  =  \\T™-Df{™)\\  (93) 

m=  1 

where  T, ^  is  a  point  from  the  template  shape  after  TPS  transformation.  The  optimal 
match  /  of  (93)  can  be  found  using  the  Hungarian  algorithm  [126].  We  emphasize  that 
the  above  process  is  necessary  only  to  disable  the  automatic  outlier  rejection  scheme  of 
the  algorithm  and  find  as  many  matches  as  possible  between  two  point  sets. 

5.3.3  Relationship  to  Previous  Work 

One  difference  from  the  previous  applications  of  relaxation  labeling  on  point  match¬ 
ing  [142]  is  that  we  use  it  to  solve  a  constrained  optimization  problem  so  the  relaxation 
labeling  process  is  guaranteed  to  converge  to  a  local  optimal  solution  [136].  In  the  previ¬ 
ous  work,  relaxation  labeling  is  used  in  an  ad  hoc  way  without  an  objective  function  to 
be  optimized,  so  no  clear  indication  exists  for  the  quality  of  the  solution.  Furthermore, 
unlike  previous  work,  we  can  enforce  one-to-one  matching  in  onr  approach  if  necessary. 

The  relaxation  labeling  method  used  in  our  approach  is  similar  to  the  well-known 
soft  assignment  technique  [25,120].  Both  convert  the  discrete  combinatorial  optimiza¬ 
tion  problem  to  one  with  continuous  variables  by  assigning  a  probability  measure  to  a 
match.  The  procedure  is  called  relaxation  or  soft  in  these  two  techniques  respectively. 
Generally,  deterministic  annealing  is  used  to  solve  the  soft  assignment  problem.  It  be¬ 
gins  with  a  high  temperature  where  the  matching  probabilities  are  uniform.  By  gradually 
decreasing  the  temperature,  the  matching  probabilities  will  converge  to  a  local  optimal 
solution  of  the  objective  function.  An  appropriate  choice  of  the  initial  temperature  and 
temperature  reduction  ratio  is  necessary  to  achieve  good  results  [25].  On  the  contrary, 
the  relaxation  labeling  based  approach  is  parameter  free.  Another  advantage  is  that  we 
can  easily  incorporate  a  meaningful  initialization  in  relaxation  labeling.  Since  the  distri¬ 
bution  of  local  optima  is  complex,  a  good  initialization  is  crucial  to  achieve  a  good  result. 
Unfortunately,  it  is  difficult  to  incorporate  an  initialization  method  into  the  deterministic 
annealing  framework.  It  is  also  a  drawback  of  another  continuous  optimization  technique 
for  graph  matching  proposed  by  Pelillo  [144] .  We  tested  the  soft  assignment  based  graph 
matching  method  [120]  and  found  the  results  were  worse  than  the  relaxation  labeling 
based  approach. 

5.4  Shape  Deformation  Models 

It  is  difficult  to  achieve  a  good  match  for  shapes  under  both  rigid  and  nonrigid  distortions 
with  a  single-step  approach.  The  strategy  of  iterated  point  correspondence  and  transfor¬ 
mation  estimations  is  widely  used  for  nonrigid  shape  matching.  In  our  approach,  for  the 
first  iteration,  the  affine  transformation  between  two  shapes  is  estimated  and  corrected. 
Instead  of  using  the  least  squares  (LS)  estimator  to  estimate  parameters  of  the  affine 
transformation  [24],  we  use  a  more  robust  least  median  squares  (LMS)  estimator.  In 
the  following  iterations,  the  thin  plate  spline  (TPS)  deformation  model  is  exploited  to 
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bring  two  shapes  closer.  Our  approach  is  similar  to  [24]  except  that  a  more  robust  LMS 
estimator  is  used  to  estimate  the  affine  transformation,  instead  of  the  LS  estimator. 


5.4.1  Affine  Transformation  Estimation  Based  on  LMS 

The  LS  estimator  is  widely  used  to  estimate  transformation  parameters.  Suppose  point 
(. Xi,yi )  is  matched  to  point  ( Ui,Vi ),  for  i  =  1,2 the  optimal  parameters  of  the 
affine  transformation  minimize  the  summation  of  squares  of  the  regression  errors. 

i,f=ar9mjn|:||(“:)—  a(*)-T 

where  A  is  a  2  x  2  matrix  representing  the  rotation  and  anisotropic  scale  changes,  and  T 
is  a  translation  vector.  One  advantage  of  the  LS  estimator  is  that  closed-form  solutions 
are  available  [145].  It  is,  however,  sensitive  to  outliers  in  matching  [146].  The  breakdown 
point  is  often  used  to  evaluate  robustness  of  an  estimator  under  outliers,  which  is  defined 
as  the  smallest  proportion  of  observations  that  must  be  replaced  by  arbitrary  values  in 
order  to  force  the  estimator  to  produce  values  arbitrarily  far  from  the  true  values  [147]. 
The  breakdown  point  of  the  LS  estimator  is  0%.  Furthermore,  it  is  difficult  to  detect 
outliers  based  on  the  regression  residual  errors  since  they  may  spread  over  all  of  the 
points  [146]. 

In  general,  the  results  of  the  first  iteration  of  point  matching  may  be  noisy  with 
many  errors,  so  a  more  robust  estimator  is  required.  Several  robust  regression  methods 
have  been  proposed  in  the  statistics  literature.  Among  them,  the  least  median  squares 
(LMS)  estimator  achieves  the  highest  possible  break  down  point,  50%  [146].  Instead  of 
minimizing  the  summation  of  squares  of  regression  errors,  the  LMS  estimator  minimizes 
the  median  of  the  regression  errors. 

A,  T  —  arg  min  median  /  ^  ^  —  A  ^  ^  —  T  for  i  —  1,  2,  •  •  • ,  n 


There  are  no  closed-form  solutions  for  (95).  Normally,  we  randomly  select  a  subset  with 
three  matched  pairs  (which  can  determine  an  affine  transformation)  and  calculate  the 
median  of  the  regression  errors  using  the  estimated  parameters.  Iterating  the  random 
selection  procedure,  an  optimal  solution  of  (95)  can  be  achieved.  Suppose  there  are  n 
matched  pairs  and  about  50%  of  them  are  wrong.  In  the  worse  case,  we  must  select  at 


least 


n 

3 


n/2 

3 


+ 1  different  subsets  to  ensure  at  least  one  subset  without  outliers 


is  selected.  This  is  too  pessimistic.  In  real  applications,  we  only  need  to  examine  a  small 
number  of  subsets.  After  examining  k  subsets,  the  probability  of  having  at  least  one 
[  f  n/2  \  f  n  \  1  k 

good  subset  is  1  —  1  —  (  /  I  (assuming  sampling  with  replacement).  For 


example,  let  n  =  200,  the  probability  of  getting  at  least  one  good  subset  in  50  random 
selections  is  99.8%.  The  LMS  estimator  can  be  used  to  estimate  the  affine  transforma¬ 


tion  without  knowing  the  correspondence  between  two  point  sets  [148].  Without  rough 
correspondence,  a  large  number  of  subsets  need  to  be  examined. 
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5.4.2  TPS  Deformation  Model 


The  TPS  model  is  often  used  for  representing  flexible  coordinate  transformations  because 
it  has  a  physical  explanation  and  closed-form  solutions  in  both  transformation  and  pa¬ 
rameter  estimation  [124],  It  has  been  used  in  nonrigid  shape  matching  in  [24]  and  [25]. 
Two  TPS  models  are  used  for  the  2-D  coordinate  transformation.  Suppose  point  (. x^yi ) 
is  matched  to  (rq,  i\)  for  i  =  1,2,  ■  ■  ■ ,  n,  let  z%  =  /(ay,  yi )  be  the  target  function  value  at 
location  ( ay,r/j ),  we  set  Zi  equal  to  tq  and  ay  in  turn  to  obtain  one  continuous  transfor¬ 
mation  for  each  coordinate.  The  TPS  interpolant  f(x,y)  minimizes  the  bending  energy 


If  = 


s?)  +2(Mi)  + 


and  has  the  solution  of  the  form 


n 

f(x,  y)  =  cn+ axx  +  ayy +  '^2wiU(\\(xi,yi)  -  (x,j/)||) 

i=  1 


(96) 


(97) 


where  U(r)  is  the  kernel  function,  taking  the  form  of  U(r)  =  r2logr2.  The  parameters  of 
the  TPS  models  w  and  a  are  the  solution  of  the  following  linear  equation 


K 

P  ' 

w 

z 

.  pT 

0 

a 

0 

(98) 


where  KtJ  =  U(\\(xi,yi)  —  (xj,yj) ||),  the  ith  row  of  P  is  (1  ,  ay,r/j),  w  and  z  are  column 
vectors  formed  from  Wi  and  zt  respectively,  and  a  is  the  column  vector  with  elements 
a\  ,ax,  and  ay. 

If  errors  appear  in  the  matching  results,  we  use  regularization  to  trade  off  between 
exact  interpolation  and  minimizing  the  bending  energy  as  follows. 


Hf  =  Y}zi~f^yi)f  +  XIf  (99) 

i= 1 


where  A  is  the  regularization  parameter,  controlling  the  amount  of  smoothing.  The 
regularized  TPS  can  be  solved  by  replacing  K  in  (98)  with  K  +  XI,  where  /  is  the  n  x  n 
identity  matrix  [149,150].  We  set  A  =  1  in  the  following  experiments. 


5.5  Summary  of  Our  Approach 

Following  is  a  brief  summary  of  our  approach. 

Input :  Two  point  sets,  Tx ,  T2, . . . ,  TM  from  the  template  shape  T,  and  Di,D2,  ■  ■  ■ ,  DN 
from  the  deformed  shape  D. 

Output:  The  correspondence  between  two  point  sets. 

1.  Set  the  transformed  template  shape  T*  as  T. 

2.  Set  iteration  number  to  one. 
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3.  Calculate  the  shape  context  for  each  point  in  T*  and  D,  and  use  (87)  to  calculate 
the  distance  between  each  point  pair  T ™  and  Dn. 

4.  Use  (88)  to  initialize  the  matching  probability  matrix  P  and  convert  it  to  a  gener¬ 
alized  doubly  stochastic  matrix  by  alternated  row  and  column  normalizations. 

5.  Use  (90)  to  update  the  matching  probability  matrix  R  (R  =  300)  times.  After  each 
update,  convert  matrix  P  to  a  generalized  doubly  stochastic  matrix. 

6.  If  the  iteration  number  is  one,  use  LMS  to  estimate  the  affine  transformation  be¬ 
tween  T  and  D. 

7.  Otherwise,  use  (98)  to  estimate  parameters  of  the  TPS  deformation  model  between 
T  and  D. 

8.  Transform  template  point  set  T  to  T*  using  the  estimated  deformation  parameters. 

9.  Increase  the  iteration  number  by  one.  If  the  iteration  number  is  less  than  Imax 
{I max  =  10),  go  to  step  3. 

Suppose  both  shapes  have  N  points,  the  computation  cost  of  shape  context  distances 
is  in  the  order  of  0(N2).  Relaxation  labeling  updates  will  take  0(N2)  time.  The  compu¬ 
tational  complexity  of  the  algorithm  may  be  largely  dependent  on  the  implementation  of 
the  spline  deformation,  which  can  be  0(N3)  in  the  worst  case.  With  our  un-optimized 
C++  implementation,  matching  two  shapes  (each  with  100  points)  takes  about  1.6  sec¬ 
onds  on  a  PC  with  a  2.8  GHZ  CPU. 

5.6  Experiments 

In  this  section,  we  show  our  approach  preserves  sequential  ordering  of  points  (a  degener¬ 
ated  neighborhood  structure)  on  open  curves  and  closed  contours  during  matching.  We 
also  test  our  approach  in  matching  real  handwriting  samples.  We  then  quantitatively 
compare  it  with  two  state-of-the-art  algorithms  for  robustness  under  deformation,  noise 
in  point  locations,  outliers,  occlusion,  and  rotation. 

5.6.1  Some  Examples 

We  have  tested  our  algorithm  on  the  samples  used  in  [25]  and  compared  our  results  with 
two  other  algorithms:  shape  context  [24]  and  TPS-RPM  [25].  In  these  examples,  the 
template  and  deformed  shapes  have  the  same  number  of  points.  To  achieve  a  direct  and 
fair  comparison,  we  prefer  to  match  as  many  point  pairs  as  possible  without  rejection. 
The  shape  context  algorithm  can  achieve  this  by  setting  the  outlier  ratio  to  zero.  The 
TPS-RPM  algorithm  and  our  relaxation  labeling  based  approach  may  reject  some  points 
as  outliers  by  matching  them  to  a  dummy  point.  Unfortunately,  there  are  no  parameters 
available  in  either  algorithm  to  adjust  the  ratio  of  rejected  points  explicitly.  In  this 
experiment,  after  point  matching  and  shape  transformation  are  finished,  we  use  the 
approach  discussed  in  Section  5.3.2  to  minimize  the  summation  of  Euclidean  distances 
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Our  Approach 


Shape  Context 


TPS-RPM 


Figure  29:  Point  matching  results  on  open  curves  (the  left  column)  and  closed  contours 
(the  middle  and  right  columns).  Top  row:  our  approach.  Middle  row:  the  shape  context 
algorithm.  Bottom  row:  the  TPS-RPM  algorithm. 

between  the  transformed  template  point  set  and  the  deformed  point  set  (see,  Equation 
(93)). 

Fig.  29  shows  the  point  matching  results  of  three  algorithms  on  a  pair  of  open  curves 
and  two  pairs  of  closed  contours.  As  shown  in  the  left  column,  all  three  algorithms  achieve 
good  results  for  the  pair  of  open  curves  even  though  the  deformation  is  large.  Neighboring 
points  may  swap  their  matches  in  TPS-RPM.  For  the  first  pair  of  closed  contours,  all 
algorithms  achieve  reasonable  results,  but  the  shape  context  algorithm  introduces  a  few 
mismatches  as  shown  in  the  middle  column  of  Fig.  29.  Since  the  rotation  between 
two  shapes  is  large  for  the  second  pair  of  closed  contours,  the  rotation  invariance  shape 
context  is  used  for  initialization  in  our  approach  and  the  shape  context  algorithm.  Both 
our  approach  and  TPS-RPM  achieve  good  results  and  preserve  the  sequential  ordering 
of  points.  The  result  of  the  shape  context  algorithm  is  not  as  good:  neighboring  points 
in  one  shape  may  be  matched  to  points  far  apart  in  the  other  shape. 

We  also  test  our  algorithm  for  handwriting  matching.  Figs.  30a  and  b  show  two 
samples  of  handwritten  initials  from  the  same  person.  We  notice  the  structural  change 
for  handwriting  is  large:  the  characters  overlap  each  other  in  the  first  sample,  but  they  are 
well  separated  in  the  second  sample.  We  uniformly  sample  200  points  from  the  skeletons 
of  the  handwriting,  as  shown  in  Figs.  30c  and  d.  Fig.  30e  shows  the  point  matching 
results  using  our  approach.  Points  labeled  with  green  color  are  outliers  rejected  by  our 
algorithm.  On  the  D’s,  most  points  are  assigned  with  correct  correspondence.  The 
touching  parts  of  the  S  are  assigned  with  low  matching  probabilities,  therefore  rejected 
as  outliers.  More  examples  of  handwriting  matching  are  shown  in  Fig.  31. 

Our  approach  is  general  enough  to  incorporate  other  neighborhood  definitions.  In 
this  experiment,  we  use  a  neighborhood  definition  which  is  robust  when  different  parts  of 
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Figure  30:  Handwriting  matching,  (a)  and  (b)  two  handwritten  initials  from  the  same 
person,  (c)  and  (d)  the  point  sets  (each  with  200  points)  sampled  from  the  skeletons  of 
(a)  and  (b),  respectively,  (e)  Point  matching  results  using  our  approach. 


Figure  31:  More  examples  of  handwriting  matching  using  our  approach. 


Figure  32:  A  neighborhood  definition  which  is  robust  under  large  nonuniform  scale 
changes  for  different  parts  of  a  shape,  (a)  Point  sets  of  the  template  (o)  and  deformed 
(+)  shapes,  (b)  Template  graph  with  210  edges,  (c)  Deformed  graph  with  196  edges. 
Among  them,  178  (91%)  edges  also  present  in  the  template  graph,  (d)  Point  matching 
result. 

a  shape  have  significantly  different  scales.  For  two  points  in  a  shape,  they  are  neighbors 
if  and  only  if  they  are  both  in  each  other’s  top  Eave  nearest  points.  This  neighborhood 
relationship  is  still  symmetric  and  scale  invariant.  A  fish  shape  is  used  to  synthesize  test 
samples.  First,  we  applied  a  moderate  amount  of  nonrigid  deformation  to  the  template 
shape,  and  then  enlarged  the  fish  tail.  Fig.  32b  and  c  show  the  graphs  of  the  template 
and  deformed  shapes  when  Eave  =  5  and  the  fish  tail  is  enlarged  to  four  times  of  the 
original  size.  This  neighborhood  definition  is  robust:  91%  of  the  edges  in  the  deformed 
graph  have  a  correspondence  in  the  template  graph,  and  a  good  point  match  is  achieved 
as  shown  in  Fig.  32d. 

5.6.2  Quantitative  Evaluation 

Synthetic  data  is  easy  to  obtain  and  can  be  designed  to  test  specific  aspects  of  an  algo¬ 
rithm.  We  test  our  algorithm  on  the  same  synthesized  data  as  in  [25]  and  [24],  Three 
sets  of  data  are  designed  to  measure  the  robustness  of  an  algorithm  under  deformation, 
noise  in  point  locations,  and  outliers.  In  each  test,  the  template  point  set  is  subjected 
to  one  of  the  above  distortions  to  create  a  target  point  set  (for  the  latter  two  test  sets, 
a  moderate  amount  of  deformation  is  present).  Two  shapes  (a  fish  and  a  Chinese  char¬ 
acter)  are  used,  and  100  samples  are  generated  for  each  degradation  level.  We  then 
run  our  algorithm  to  find  the  correspondence  between  these  two  sets  of  points  and  use 
the  estimated  correspondence  to  warp  the  template  shape.  The  accuracy  of  the  match  is 
quantified  as  the  average  Euclidean  distance  between  a  point  in  the  warped  template  and 
the  corresponding  point  in  the  target.  Alternative  evaluation  metrics  are  possible  (e.g., 
the  number  of  correctly  matched  point  pairs),  but  in  order  to  compare  our  results  directly 
with  two  other  algorithms,  we  use  the  same  evaluation  metric  as  in  [25]  and  [24],  Fig. 
33  shows  several  examples  from  the  synthesized  data  sets,  and  Fig.  34  demonstrates  the 
quantitative  evaluation  results.  Since  the  new  neighborhood  definition  presented  above 
is  not  robust  to  outliers,  the  original  version  is  used  throughout  this  experiment.  Our 
algorithm  performs  best  on  the  deformation  and  noise  sets.  For  the  outlier  test  set,  how¬ 
ever,  no  clear  winner  appears.  The  TPS-RPM  algorithm  outperforms  our  algorithm  on 
the  Chinese  character  shape  under  large  outlier  ratios.  Since  points  are  spread  out  on  the 
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Figure  33:  Chui-Rangarajan  synthesized  data  sets.  The  template  point  sets  are  shown 
in  the  first  column.  Column  2-4  show  examples  of  target  point  sets  for  the  deformation, 
noise,  and  outlier  tests  respectively. 


Chinese  character  shape,  when  a  large  number  of  outliers  are  present,  the  neighborhood 
of  a  point  changes  significantly  (as  shown  in  the  last  column  of  Fig.  33),  which  violates 
our  assumption.  Points  on  the  fish  shape  are  clustered,  and  the  neighborhood  of  a  point 
is  preserved  well  even  with  a  large  outlier  ratio.  Therefore,  better  results  are  achieved 
by  our  algorithm  on  this  shape. 

Often  present  in  real  applications,  occlusion  is  a  challenge  for  many  algorithms.  In 
the  following  experiments,  we  test  the  three  algorithms  under  occlusion  using  synthesized 
data.  A  moderate  amount  of  nonlinear  deformation  is  applied  to  a  shape.  We  then 
randomly  select  a  point  and  remove  it  with  some  of  its  closest  points.  Six  occlusion 
levels  are  used:  0%,  10%,  20%,  30%,  40%,  and  50%,  and  100  samples  are  generated 
for  each  level.  The  top  row  of  Fig.  35  shows  two  synthesized  samples.  Quantitative 
evaluation  results  are  shown  in  the  bottom  row  of  Fig.  35.  The  TPS-RPM  algorithm 
treats  all  extra  points  as  outliers,  which  are  assumed  to  be  independently  distributed. 
Since  it  does  not  model  the  distribution  of  occlusions  well,  the  performance  of  TPS-RPM 
deteriorates  quickly.  In  our  approach,  the  change  of  neighborhood  structure  is  restricted 
to  points  close  to  the  occlusion.  As  shown  in  Fig.  35,  our  approach  achieves  the  best 
results  for  up  to  40%  occlusion.  When  the  occlusion  ratio  is  large,  a  shape  is  likely  to  be 
broken  into  several  parts  and  the  neighborhood  structure  of  almost  all  remaining  points 
may  be  changed.  Therefore,  when  the  occlusion  ratio  is  50%,  the  difference  between  our 
approach  and  the  shape  context  algorithm  is  small. 

In  some  applications,  rotation  invariance  is  a  critical  property  of  a  shape  matching 
algorithm.  We  test  our  algorithm  under  rotations  using  synthesized  data  of  the  same  fish 
and  Chinese  character  shapes.  A  moderate  amount  of  nonlinear  deformation  is  applied 
to  a  shape,  and  the  ground-truthed  correspondences  are  used  to  correct  the  rotation 
introduced  in  the  deformation.  We  then  rotate  the  deformed  shape.  The  probability  of 
selecting  a  clockwise  or  counterclockwise  rotation  is  equal.  Six  rotations  are  used:  0,  30, 
60,  90,  120,  and  180  degrees.  One  hundred  samples  are  generated  for  each  rotation.  The 
top  row  of  Fig.  36  shows  two  synthesized  samples.  At  the  first  iteration,  the  rotation 
invariant  shape  context  distance  is  used  to  initialize  the  matching  probabilities  in  our 
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Figure  34:  Comparison  of  our  results  (o)  with  the  TPS-RPM  (*)  and  shape  context  (□) 
algorithms  on  the  Chui-Rangarajan  synthesized  data  sets.  The  error  bars  indicate  the 
standard  deviation  of  the  error  over  100  random  trials.  Top  row:  the  fish  shape.  Bottom 
row:  the  Chinese  character  shape. 


approach.  The  rotation  between  two  shapes  is  corrected  by  the  affine  transformation  in 
the  first  iteration.  After  that,  the  normal  shape  context  distance  is  used.  Quantitative 
evaluation  results  appear  in  the  bottom  row  of  Fig.  36.  We  can  see  that  our  method 
is  truly  rotation  invariant,  and  it  consistently  outperforms  the  shape  context  algorithm. 
TPS-RPM,  however,  can  only  tolerate  a  rotations  up  to  60  degrees.  The  TPS-RPM  algo¬ 
rithm  often  fails  to  converge  to  a  useful  solution  if  rotation  with  any  degree  is  allowed  [25] , 
so  a  parameter  A2  is  used  to  penalize  a  large  rotation  in  the  TPS-RPM  algorithm.  If  A2 
is  set  to  zero,  its  performance  deteriorates  significantly,  much  worse  than  our  approach 
at  any  level  of  rotation.  Therefore,  the  default  setting  of  A2  (A2  =  0.01)  is  used  in  this 
comparison  experiment  for  the  TPS-RPM  algorithm. 

The  variance  of  all  algorithms  is  large.  Therefore,  a  statistical  analysis  must  be  ap¬ 
plied  to  ascertain  whether  the  difference  between  these  algorithms  is  significant.  Mean 
and  variance  can  fully  characterize  only  a  Gaussian  distribution.  Fig.  37a  and  b  show 
the  error  histograms  of  the  shape  context  algorithm  and  our  approach.  The  histograms 
are  generated  on  100  trials  of  the  fish  shape  under  the  deformation  level  of  0.05.  The 
distributions  differ  significantly  from  a  Gaussian  distribution.  Some  challenging  samples 
deteriorate  the  performance  and  increase  the  variance,  and  the  performance  of  two  algo¬ 
rithms  on  the  same  sample  is  not  independent.  Fig.  37c  shows  the  histogram  of  paired 
differences  between  two  algorithms  (the  error  of  the  shape  context  algorithm  minus  that 
of  our  approach).  The  two  algorithms  have  the  same  performance  for  about  one  third  of 
the  test  samples,  and  our  approach  outperforms  the  shape  context  algorithm  on  most  of 
the  remaining  samples. 

Since  the  distribution  of  errors  is  not  Gaussian,  we  use  the  Wilcoxon  paired  signed 
rank  test,  which  is  powerful  and  distribution  free  [151].  In  the  Wilcoxon  test,  paired 
differences  are  formed,  and  the  absolute  values  are  ranked.  Where  ties  occur,  the  average 
of  the  corresponding  ranks  is  used.  If  the  difference  between  two  measures  is  zero,  this 
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Figure  35:  Comparison  of  our  results  (o)  with  the  TPS-RPM  (*)  and  shape  context  (□) 
algorithms  under  occlusion.  Left  column:  the  fish  shape.  Right  column:  the  Chinese 
character  shape.  Top  row:  synthesized  samples.  Bottom  row:  mean  and  variance  of 
errors. 


Figure  36:  Comparison  of  our  results  (o)  with  the  TPS-RPM  (*)  and  shape  context  (□) 
algorithms  under  rotation.  Left  column:  the  fish  shape.  Right  column:  the  Chinese 
character  shape.  Top  row:  synthesized  samples.  Bottom  row:  mean  and  variance  of 
errors. 
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Figure  37:  Histogram  of  errors,  (a)  The  shape  context  algorithm,  (b)  Our  approach,  (c) 
Paired  differences  between  two  methods  (the  error  of  the  shape  context  algorithm  minus 
that  of  our  approach). 


Table  10:  Wilcoxon  paired  signed  rank  test.  +,  — ,  and  =  mean  the  former  algorithm  is 
better,  worse,  or  no  difference  than  the  latter,  respectively. 


Fish 

Chinese  character  | 

Ours  vs.  shape  context 

Ours  vs.  TPS-RPM 

Ours  vs.  shape  context 

Ours  vs.  TPS-RPM 

Deformation 

=  =  +  +  + 

=  =  +  +  + 

+  +  +  +  + 

+  +  +  +  + 

Noise 

=  +  +  +  +  + 

+  =  +  +  +  + 

+  +  +  +  =  + 

+  +  +  +  +  + 

Outlier 

+  +  +  +  + 

+  +  +  =  = 

+  +  +  +  + 

+  =  =  -  " 

Occlusion 

=  +  +  +  +  = 

+  +  +  +  +  = 

+  +  +  +  +  + 

+  +  +  +  +  + 

Rotation 

+  +  +  +  +  + 

=  +  +  =  =  = 

--  +  +  +  + 

=  =  +  +  +  + 

sample  is  excluded  from  the  analysis.  The  sum  of  the  ranks  with  a  positive  sign  and  the 
sum  of  the  ranks  with  a  negative  sign  are  calculated.  The  test  statistic  is  the  smaller  of 
these  two  sums.  Table  10  shows  the  statistical  analysis  (with  two-sided  significance  level 
of  0.01)  of  the  performance  of  onr  approach  compared  with  two  other  algorithms.  Here, 
+  (— )  means  the  improvement  (deterioration)  of  our  approach  is  statistically  significant 
compared  with  the  other  algorithm,  and  =  means  two  algorithms  do  not  differ  significant. 
The  statistical  test  verifies  that  the  improvement  of  our  approach  on  most  data  sets  is 
significant. 

5.7  Conclusions  and  Future  Work 

In  this  chapter,  we  have  introduced  the  notion  of  a  neighborhood  structure  for  the  general 
point  matching  problem.  We  formulated  point  matching  as  an  optimization  problem  to 
preserve  local  neighborhood  structures  during  matching.  Extensive  experiments  were 
presented  to  demonstrate  the  robustness  of  our  approach.  Compared  with  the  other 
two  state-of-the-art  algorithms,  our  approach  performs  as  well  or  better  under  nonrigid 
deformation,  noise  in  point  locations,  outliers,  occlusion,  and  rotation. 

Large  outlier  or  occlusion  ratios  (especially  if  the  occlusion  breaks  a  shape  into  several 
isolated  parts)  can  significantly  change  the  local  neighborhood  structure.  A  combination 
of  different  sources  of  degradation,  such  as  large  rotation,  noise,  and  occlusion,  also 
presents  a  challenge,  which  should  be  addressed  in  our  future  research.  In  this  work,  the 
relaxation  labeling  method  is  used  to  solve  the  constrained  optimization  problem.  Only 
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converging  to  a  local  optimum,  it  is  by  no  means  the  best  approach.  We  are  testing  other 
optimization  methods  such  as  simulated  annealing,  genetic  algorithms,  and  graduated 
nonconvexity  methods.  Our  graph  matching  formulation  is  applicable  for  both  2-D  and 
3-D  shapes.  Using  the  shape  context  distance  for  initialization,  we  only  demonstrate  it 
on  2-D  shapes,  since  the  original  shape  context  is  defined  only  for  2-D  point  sets.  We 
will  test  the  effectiveness  of  our  approach  for  3-D  shape  matching  by  extending  the  shape 
context  to  3-D  point  sets. 

A  reference  C++  implementation  of  our  approach  is  available  under  the  terms  of  the 
GNU  General  Public  License  (GPL)  at  http :  //www.  enee  .urad.  edu/~zhengyf  /PointMatching. 
htm. 
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Appendix 

Sinkhorn  showed  that  iterated  alternative  row  and  column  normalization  will  convert  an 
NxN  matrix  with  positive  elements  to  a  doubly  stochastic  matrix  [143] .  In  our  relaxation 
labeling  approach,  we  perform  iterated  alternative  row  and  column  normalization  (except 
the  last  row  and  column)  to  a  non-square  K  x  N  ( K  ^  N )  matrix  A.  This  appendix 
demonstrates  this  approach  is  mathematically  sound:  the  process  will  converge  to  a 
unique  matrix  T 4,  such  that  each  row  and  column  of  T A  sums  one  (except  the  last  row 
and  column).  The  proof  in  this  appendix  adheres  to  Sinkhorn’s  approach.  In  [143],  several 
important  steps  are  skipped  and  a  few  typographical  errors  exist.  In  this  appendix,  more 
cases  are  discussed  to  generalize  Sinkhorn’s  conclusion.  First,  we  give  a  formal  definition 
of  our  generalized  doubly  stochastic  matrix.. 

DEFINITION  1.  A  K  x  N  matrix  A  is  called  a  generalized  doubly  stochastic  matrix 
if 

K 

y:  <M3  =  1  for  j  =  1,2, . . . ,  N  —  1 

i= 1 

N 

a,ij  =  1  for  i  —  1,  2, . . . ,  K  —  1 

3= 1 

The  operation  of  row  normalization  can  be  represented  as  a  left  multiplication  of  A 
with  a  diagonal  matrix,  and  the  operation  of  column  normalization  can  be  represented 
as  a  right  multiplication  of  A  with  another  diagonal  matrix.  Multiple  row  (column) 
normalization  matrices  can  be  combined  as  D\  ( D-2 ).  Therefore,  the  overall  iterated  row 


(100) 

(101) 
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and  column  normalization  can  be  represented  as  T \  =  DiAD2.  The  following  theorem 
establishes  the  uniqueness  of  such  a  representation. 

Theorem  1  To  a  given  strictly  positive  K  x  N  matrix  A  there  corresponds  exactly  one 
generalized  doubly  stochastic  matrix  T 4  which  can  be  expressed  in  the  form  T 4  =  DiAD2 
where  D\  and  D2  are  diagonal  matrices  with  positive  diagonals.  D\  =  diag{dn  ,  di2, . . . ,  ditK-i,  1} 
and  D2  =  diag{d2i,  d22, . . . ,  d2:N-i,  1}-  The  matrices  D\  and  D2  are  unique. 

Proof:  Suppose  there  exist  two  different  pairs  of  diagonal  matrices  Dl,  D2  and  Cj , 

C2  such  that  P  =  C\AC2  and  Q  =  D\AD2  are  both  generalized  doubly  stochastic. 

Then,  we  can  write  Q  as  Q  =  D\Cf1PCf1D2.  Let  E  =  DiCf1  and  F  =  CflD-2l  then 
Q  =  EPF.  Suppose  E  =  diagjei,  e2,  •  •  • ,  eK-i,  1}  and  F  =  diag{/i,  f2, . . . ,  /tv-  1, 1},  Q 
can  be  expanded  as 

eifiPn  ei/2Pi2  •  •  •  ci/tv-iT/tv-i  eiPiN 

C2/1-P21  e2f2P22  . . .  e2fN-iP2,N-i  &2P2N 

Q  \  \  \  \  \  (102) 

CK-lflPK-1,1  eK-lf2PlC-l,2  ■  ■  ■  CK-lf N-iPk-1,N-1  ^K~iPk~1,N 
I\Pk\  f 2  Pr  2  ■  ■  ■  In-iPk.N-I  Pk,N 

The  summation  of  the  ith  row  of  Q  equals  1,  for  1  <  i  <  K  —  1. 

ei(flPil  +  ,/'2  Pj/2  +  •  •  •  +  fN-\Pi,N-\  +  Pin)  =  1  (103) 

Since  Y^f=i  Pj  =  1  and  >  0,  1  Id  is  a  convex  combination  of  {fj,  1}.  Therefore, 

rninjl,  /,■}  <  —  <  max{l,  /,}  for  i  =  1, 2, . . . ,  K  —  1  (104) 

i  e-i  3 

Similarly,  we  can  get 

rninjl,  ei}  <  —  <  rnaxjl,  e*}  for  j  —  1,  2, , . . ,  N  —  1  (105) 

*  Jj  1 

There  are  three  cases:  1)  max,;  et  <  1;  2)  ruin,  e%  >  1;  and  3)  min,  e%  <  1  <  max,  et. 

Let’s  discuss  the  first  case  that  max^e*  <  1.  Using  the  second  inequality  in  Eq.  (105), 
we  get  fj  >  1.  Then  second  inequality  in  Eq.  (104)  becomes  1  <  rriax^  /) .  It  follows 
that 

1  <  min  e,;  max  /)  (106) 

i  3 

Similarly,  the  first  inequality  in  Eq.  (105)  becomes  fj  min,;  et  <  1.  Therefore, 

mine,;  max /j  <  1.  (107) 

*  3 

Combining  the  above  two  inequalities,  we  get 

min  e*  max  f  ..  =  1  (108) 

i  j 
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Let  consider  the  summation  of  the  row  of  Q  corresponding  to  the  minimum  e*.  Sup¬ 
pose  e\  =  min,;  ei 


1  —  Ci(/iPn  +  /2-P12  +  •  •  •  +  fN-\P\,N-i  +  Pin) 

<  e\  [max  +  P12  +  •  •  •  +  Pi,jv— 1)  +  Pin] 

3 

<  ei  max  j)  (Pu  +  P12  +  ...  +  Pin) 

3 

<  e\  max  fj 

3 

=  1  (109) 

The  equality  holds  if  and  only  if  fi  =  f2  =  •  •  •  =  ,1'n-i  =  1-  And  considering  the  column 
with  the  maximum  fj,  we  get  e\  =  e2  =  •  •  •  =  e^-i  =  1. 

For  the  second  case,  min,  >  1,  it  is  easy  to  verify  that 

max  e{  min  f)  —  1  (HO) 

*  3 

And  for  the  last  case,  min,  et  <  1  <  max*  e% ,  we  can  get  both  equalities  (108)  and  (110). 
Following  similar  arguments,  we  can  show  that  the  equalities  ./1  —  f 2  —  ■■■  —  .I'n-i  =  1 
and  ei  —  e2  —  ■■■  —  c-k-i  =  1  hold  for  all  cases.  It  follows  that  D\  =  C\,  D2  =  C2,  and 
P  =  Q.  That  means  such  factorization  is  unique,  and  the  resulted  generalized  doubly 
stochastic  matrix  is  unique  too. 

Theorem  2  The  iterative  process  of  alternately  normalizing  the  rows  and  columns  ( ex¬ 
cept  the  last  row  and  column)  of  a  strictly  positive  K  x  N  matrix  is  convergent  to  a 
strictly  positive  generalized  doubly  stochastic  matrix. 


Proof:  The  iteration  produces  a  sequence  of  positive  matrices  which  alternately  have  row 
(except  the  last  row)  and  column  (except  the  last  column)  sums  one.  We  will  show  that 
the  two  subsequences  which  are  composed  respectively  of  the  matrices  with  row  sums  one 
and  the  matrices  with  column  sums  one  each  converge  to  a  positive  generalized  doubly 
stochastic  limit  of  the  form  DiAD2.  The  uniqueness  part  of  Theorem  1  guarantees  two 
limits  are  the  same.  In  the  following,  we  only  show  the  convergence  of  the  subsequence 
of  the  matrices  with  column  sums  one.  The  convergence  of  the  other  subsequence  is  easy 
to  show  following  similar  arguments. 

Let  {An}  =  {(anij)}  be  the  sequence  with  column  sums  one  (except  the  last  column), 
and  An  have  row  sums  Ani,  \n2, . . . ,  After  row  normalization,  we  calculate  the 

column  sums  Snj  (for  1  <  j  <  N  —  1) 

K—l 

dnj  =  E  U-nij /AnJ  T  anKj  (ill) 

i=  1 


Since  amj  =  1,  bnj  is  a  convex  combination  of  {1/Anj,  1}.  It  follows 
1  „  1 


max{l,A n(M)}  min{l,  A„(m)} 


for  J  =  1,2,...,  IV  —  1 


(112) 


where  the  m  and  M  respectively  label  minimal  and  maximal  quantities  relative  to  a  given 
An.  Similarly,  since  of  matrix  An+1  is  a  convex  combination  of  {1  /Snj,  1},  it  follows 


that 


max{l,  5n(M)} 


A  A n+l,i  A 


min{l,  8n(m)} 


for  i  =  1,  2, . . . ,  K  —  1 


(113) 


There  are  three  cases:  1)  A n{m)  >  1;  2)  A n(M)  <  1;  and  3)  A n(m)  <  1  <  A n(M).  For 
the  first  case  A n(m)  >  1,  from  Eq.  (112)  we  get  1/A n(M)  <  5nj  <  1.  Using  Eq.  (113), 
we  get 


1  <  An+M  <  A n(M) 


(114) 


Therefore, 


case  1:  A n(m)  >1  1  <  An+i(m)  and  1  <  An+i(M)  <  A n(M)  (115) 


Similarly 


case  2:  An(M)  <1  An+i(M)  <  1  and  A n(m)  <  \n+i(m)  <  1(116) 

case  3:  An(m)  <  1  <  A n(M)  A n(m)  <  A n+1(m)  <  1  <  Xn+1(M)  <  A n(M\117) 


In  the  following,  we  want  to  show  that  for  case  1  and  3,  A n(M)  left  converges  to  1  (from 
a  value  larger  than  1);  and  for  case  2  and  3,  A n(m)  right  converges  to  1  (from  a  value 
smaller  than  1).  If  the  convergence  holds,  using  Eq.  (112),  it  follows  that  5nj  converges 
to  1  too.  Therefore,  the  sequence  of  matrices  An  converges  to  a  generalized  doubly 
stochastic  matrix. 

Let  an  be  the  minimal  element  of  An  (excluding  the  last  row  and  column),  we  want 
to  show  that  an  >  0  for  all  n.  Starting  from  A1  =  {« iiy } ,  we  can  combine  all  row 
normalizations  of  row  i  (i  <  K)  up  to  nth  iteration  as  xni  =  [Ai^i  •  •  •  Anj]_1.  For  the 
last  row  xnx  =  1.  All  column  normalization  of  column  j  (j  <  n )  up  to  nth  iteration  is 
combined  as  ynj  =  [5i j$2j  •  •  •  Snj]-1  •  For  the  last  column  ynN  =  1.  Since  summation  of 
column  j  of  An  equals  one,  £T=1  %niatfyn]  =  1,  for  j  =  1, 2, . . . ,  N  —  1,  we  get 

111  .  , 

ynj  =  yv - < - < -  (118) 

/  jj  Qjlij%ni  Qjlij%ni  ^1  %ni 

In  particular  yn]  <  l/[a\Xn{M)\.  Since 

N 

^  ^  X at (*  1  ij ijnj  An+i>;j  (119) 

3= 1 

As  we  can  see  from  (115),  (116)  and  (117),  for  all  three  cases,  An+ijj  is  bounded  away 
from  0.  Let  An+i;j  >  A,  it  follows  that 

xni  >  - >  ai\xn(M)/N.  (120) 

Z-^j  °1  ijUnj 

The  last  inequality  is  derived  from  the  fact  that  a  \ <  1.  Also  ynj  =  1  /  Y^iaujxm  A 
1  f[Nxn(M)]  and  we  see  that  an+ 1:ij  =  xnia\t/ynj  >  ai\/N2  =  a  >  0.  Therefore,  an  >  0 
for  all  n. 
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For  case  1  and  3,  we  want  to  show  that  A n(M)  right  converge  to  1.  It  is  clear  that 
A n(M)  — >  1  +  c  where  c  >  0.  For  convenience  set  A n(M)  —  1  +  cn. 


K- 1 


i=  1 


nij 


=  e^+^e^+  e 


H-  ^nKj 


1 


1  _  52i=l  anij  +  Cn  X/i: 

O-nKj 


-  XX  a«o  +  1  ,  „ 


XI  °mj  +  1  c 


\  .  ->  1  0>  n; 


1  +  c„ 


$21) 


Using  the  fact  that  Yhianij  —  1, 


3nj  — 


1  +  Cn 


1  +  Cn 


A„i>l  a"ij  >  1  +  c„an 


1  +  Cn 


(122) 


It  follows  that 


N—l 


A 


■n+l,i 


E 


ar 


N-l 

“ nij  ^  Q> niN  ^  J-  T  Cn  ^  ^  dnij  ^  ®niN 

,  A ni$nj  A 1  -f*  Cn(ln  A nj  Anj 

J=1  J  J=1 


(123) 


Since  0  <  an  <  1,  therefore  (1  +  cn)/(l  +  cnan)  >  1,  thus 


An+l,i  E 


1  +  Cn 
1  ~t~  Cn/i, 


'iV-1 

E 

,j=l 


dnij  ^  O' nNj 
^ ni  ^ ni 


(124) 


Because  X)X=i  anij/Xni  =  1  (the  row  summation  after  row  normalization),  therefore, 


>  ,  1  +  cn  1  +  cn 

/A-ri  -1- 1  ^  ~  \  ~ 

1  +  cnan  1  +  cna 

The  above  inequality  holds  for  all  i,  particularly, 

1  +  cn 


1  +  c  <  A„,+i(M)  < 


1  T  cna 


(125) 


(126) 


Since  cn  — >•  c,  the  above  condition  holds  if  and  only  if  c  =  0.  Therefore  A n(M)  — >  1. 

For  case  2  and  3,  we  need  to  show  that  A n(m)  left  converge  to  1.  Let  A n(m)  — >  1  —  d 
where  d  >  0,  and  A n(m)  =  1  —  dn,  then 


i-^ni 


V,=  E  xfi+  E 

''m  x  ^  /xm 


i-^ni>  1 


* nij 


“1“  & nMj  — 


1  dm 


XX  XX  an*j  “1“  Q"nMj 


i-^ni 


i-^ni^  1 


1  dnan 
1  dn 
(127) 

~  ,  (128) 
-L  -L  un(X 

Since  dn  — >  d,  the  above  condition  holds  if  and  only  if  d  —  0.  It  follows  A ,n(m)  — >  1.  This 
completes  the  proof. 


And 


1  —  d  >  An+i  (m)  >  - r— —  > 


6  Handwriting  Synthesis 

6.1  Introduction 

A  statistical  pattern  recognition  system  depends  heavily  on  the  size  and  quality  of  the 
training  set.  Although  preparing  samples  of  machine  printed  text  is  easy,  doing  so  is  ex¬ 
pensive  for  handwriting.  Synthesized  data  can  be  used  as  a  supplement.  In  the  previous 
work,  many  handwriting  synthesis  approaches  have  been  proposed  to  simulate  the  writ¬ 
ing  style  of  a  person  [152],  or  enlarge  the  training  set  for  a  recognition  system  [107,108]. 
They  can  be  roughly  categorized  as  perturbation-based,  model-based,  or  example-based. 
Perturbation-based  methods  need  only  one  handwriting  sample.  New  samples  are  gen¬ 
erated  by  assigning  random  parameters  to  a  deformation  model,  which  is  then  used  to 
deform  the  sample  [25,108].  However,  without  considering  the  deformation  characteristics 
of  handwriting,  unrealistic  samples  may  be  generated.  Instead,  model-based  approaches 
learn  the  deformation  of  handwriting  and  explicitly  describe  it  as  a  distribution  (the  dis¬ 
tribution  is  often  called  a  model)  [152,153].  After  learning,  handwriting  synthesis  is  the 
process  of  drawing  new  samples  from  the  distribution.  Although  theoretically  founded, 
model-based  approaches  have  some  drawbacks  in  real  applications:  handwriting  models 
are  often  complex,  and  many  samples  are  demanded  for  model  training.  Example-based 
approaches  use  two  handwriting  samples  and  generate  new  samples  with  shapes  similar 
to  both  inputs  [107].  Compared  with  model-based  approaches,  fewer  samples  are  needed 
because  this  approach  does  not  need  to  learn  the  distribution  of  deformation.  Both  model- 
based  and  example-based  approaches  need  to  perform  handwriting  matching,  which  is  a 
challenge  because  handwriting  is  a  nonrigid  shape. 

6.2  Our  Approach 

The  key  problem  of  handwriting  synthesis  involves  generating  samples  that  look  natural. 
Otherwise,  arbitrarily  synthesized  samples  cannot  improve  (if  not  deteriorate)  the  per¬ 
formance  of  the  system  trained  on  them.  Although  handwriting  samples  vary  greatly  in 
respect  to  size,  rotation,  and  stroke  width,  shape  is  generally  used  to  categorize  them  into 
different  classes.  Since  nonrigid  deformation  of  handwriting  is  large,  we  argue  that  a  syn¬ 
thesis  algorithm  should  learn  the  shape  deformation  characteristics  from  real  handwriting 
samples.  It  is  reasonable  to  assume  that  the  shape  space  of  handwriting  with  the  same 
content  (e.g.,  the  handwriting  samples  of  the  letter  ‘a’)  is  continuous.  For  characters  with 
several  different  writing  glyphs,  such  as  number  ‘7,’  we  may  need  to  do  clustering  analysis 
to  segment  the  shape  space  into  multiple  continuous  sub-spaces.  Given  two  handwriting 
samples  close  in  the  shape  space,  the  interpolation  between  them  is  likely  to  lie  inside  the 
shape  space  too  (this  is  guaranteed  if  the  shape  space  is  convex).  That  means,  given  two 
actual  and  similar  handwriting  samples,  it  is  reasonable  to  assume  a  person  may  write 
with  a  shape  between  them  (i.e.,  with  similar  but  less  degree  of  deformation). 

In  this  chapter,  we  propose  an  example-based  handwriting  synthesis  approach  using 
two  training  samples.  We  use  our  handwriting  matching  algorithm  to  establish  the  cor¬ 
respondence  between  two  handwriting  samples.  After  handwriting  matching,  we  warp 
one  sample  toward  the  other  using  the  TPS  deformation  model.  By  adjusting  the  reg¬ 
ularization  parameter  A  of  the  TPS  deformation  model  (Equation  (129)),  we  can  adjust 
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Figure  38:  Handwriting  synthesis.  Two  training  samples  and  the  point  matching  result 
are  shown  on  the  top  row.  Synthesized  samples:  second  row  for  our  approach,  third  row 
for  [25],  and  forth  row  for  [107]. 


the  amount  of  non-rigid  deformation. 

n 

H,  =  J2lz‘-f(^yt)]2  +  XIf  <129) 

i= 1 

Please  refer  to  Section  5.4  for  a  detailed  description  of  the  regularization.  The  regular¬ 
ization  parameter  A  is  used  in  a  different  way  compared  to  the  shape  matching  in  the 
previous  chapter.  Here,  we  use  it  to  adjust  the  amount  of  non-rigid  deformation.  It  has 
been  shown  that  if  A  =  oo,  the  deformation  is  the  affine  transformation.  With  a  smaller 
A,  the  interpolated  shape  is  closer  to  the  target  shape.  On  shape  matching,  however,  it 
is  used  to  reduce  the  effect  of  outliers  in  the  match  estimate. 

Among  all  previous  work,  the  algorithm  proposed  by  Mori  et  al.  [107]  is  the  most 
similar  to  our  approach.  They  use  the  well-known  iterated  closed  point  (ICP)  algo¬ 
rithm  [111]  to  get  the  displacement  vector  of  each  pixel.  A  new  sample  is  generated  by 
moving  each  pixel  along  its  displacement  vector.  Compared  with  our  approach,  it  has 
two  drawbacks:  (1)  the  ICP  algorithm  is  not  robust  under  nonrigid  deformation  [25],  and 
(2)  the  displacement  field  is  not  continuous,  so  the  synthesized  sample  may  change  the 
topology. 

6.3  Experiments 

In  this  section,  we  apply  our  handwriting  matching  algorithm  to  handwriting  synthesis 
and  compare  it  with  two  other  algorithms  [25,107].  The  top  row  of  Fig.  38  shows  two 
handwriting  samples  and  their  point  matching  result.  Under  this  match,  we  use  TPS  to 
deform  the  first  sample  to  synthesize  new  samples,  as  shown  in  the  second  row.  From 
left  to  right,  A  takes  the  value  of  oo,  10,  1,  and  0.1,  respectively.  With  the  decrease 
of  the  regularization  parameter  A,  the  synthesized  sample  is  closer  to  the  second  real 
sample.  The  third  row  in  Fig.  38  shows  synthesized  samples  using  a  perturbation-based 
approach  [25].  With  only  one  training  sample,  the  deformation  of  synthesized  samples  is 
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Figure  39:  Handwriting  synthesis  with  point  matching  errors  presented.  Two  training 
samples  and  the  point  matching  result  are  shown  on  the  top  row.  Synthesized  samples: 
second  row  for  our  approach,  third  row  for  [107]. 


random,  and  sometimes  unnatural  samples  may  be  generated.  The  third  row  shows  the 
synthesized  samples  using  the  approach  of  [107].  Suppose  point  T)  is  matched  to  point 
.D/(j),  its  displacement  is  di  =  Df p)  —  T%.  For  an  un-matched  point,  the  displacement 
takes  the  value  of  its  closest  matched  point.  A  new  sample  is  synthesized  by  moving  a 
point  along  its  displacement. 

Si  =  Ti  +  pdi  (130) 

If  p  —  0,  the  synthesized  sample  is  the  original  template;  and  if  p  —  1,  all  matched 
points  will  be  moved  to  the  target  under  any  matching  function.  The  last  row  of  Fig. 
38  shows  several  synthesized  samples  with  p  equals  0.2,  0.4,  0.6,  and  0.8,  respectively. 
As  shown  in  the  figure,  both  example-based  approaches  can  learn  shape  deformation 
characteristics  given  good  point  matching  results.  The  approach  of  [107],  however,  may 
change  the  topology  (the  synthesized  handwriting  is  broken  into  several  parts)  due  to 
the  dis-continuity  of  the  displacement  field.  This  drawback  becomes  obvious  when  point 
matching  errors  (which  are  un-avoidable  in  real  applications)  are  present.  The  second 
and  third  rows  of  Fig.  39  show  synthesized  samples  using  our  approach  and  [107].  The 
topology  of  samples  synthesized  by  our  approach  is  preserved,  even  under  substantial 
matching  errors.  Using  the  approach  of  [107],  unrealistic  samples  are  generated. 

More  examples  on  handwriting  synthesis  are  shown  in  Fig.  40.  The  ordering  of 
samples  is  the  same  as  Fig.  38.  Samples  with  different  slants  (within  the  range  of  the 
slant  between  two  training  samples)  are  generated  using  our  approach. 

6.4  Discussion  and  Future  Work 

In  this  chapter,  we  applied  our  handwriting  matching  algorithm  to  synthesize  new  train¬ 
ing  samples  for  handwriting  recognition.  Our  approach  automatically  learns  shape  defor¬ 
mation  characteristics  from  training  samples,  generating  more  visually  realistic  samples. 
Although  it  has  been  shown  in  several  independent  experiments  that  synthesized  hand¬ 
writing  samples  can  improve  a  recognition  system  trained  on  them  [107, 108],  we  will 
perform  more  experiments  to  verify  it  in  the  future.  The  limitation  of  our  approach  is 
that  we  assume  the  same  character  is  written  with  similar  shape  though  with  some  degree 
of  distortion.  Due  to  difference  in  geological  location,  education,  and  culture,  people  may 
write  the  same  character  with  significantly  different  shapes.  We  may  need  to  perform 
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Figure  40:  More  examples  on  handwriting  synthesis.  Two  training  samples  and  the 
point  matching  result  are  shown  on  the  top  row.  Synthesized  samples:  second  row  for 
our  approach,  third  row  for  [25],  and  forth  row  for  [107]. 


clustering  analysis  to  separate  the  training  samples  into  several  clusters.  Each  cluster  is 
more  homogeneous,  so  our  approach  can  be  applied. 
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7  Handwriting  Retrieval 
7.1  Introduction 

Handwriting  not  only  conveys  information  by  its  content  (what  one  writes),  but  also 
contains  unique  characteristics  of  the  writer  (how  one  writes).  People  have  produced 
huge  amounts  of  handwritten  documents  in  history  and  continue  to  do  so.  With  the 
coming  of  the  electronic  document  era,  the  traditional  library  faces  the  challenge  of  how  to 
make  an  enormous  amount  of  handwritten  historical  documents  accessible  electronically. 
Scanning  documents  into  a  computer  is  easy.  However,  without  a  searchable  index,  the 
scanned  images  have  little  use.  In  one  solution,  optical  character  recognition  (OCR) 
can  convert  handwriting  to  text,  then  the  traditional  text-based  retrieval  techniques  can 
perform  their  tasks.  However,  the  state-of-the-art  techniques  for  handwriting  recognition 
are  error  prone.  No  reliable  product  is  available  to  recognize  unconstrained  handwriting, 
except  for  handwritten  digits  [3].  Recognition  errors  will  significantly  deteriorate  the 
performance  of  the  traditional  text-based  techniques.  Direct  retrieval  based  on  image, 
without  recognition,  may  achieve  better  results  [154],  Another  application  of  handwriting 
retrieval  is  searching  a  specified  signature  in  a  set  of  documents.  Handwriting  recognition 
has  little  use  in  this  application  because  the  signature’s  character  sequence  has  little  or 
no  inherent  importance.  Instead,  its  unique  characteristics  are  of  more  concern.  In  these 
applications,  the  direct  retrieval  of  handwriting  in  document  images  is  desired. 

The  retrieval  of  character  sequences  is  also  called  keyword  spotting  in  the  litera¬ 
ture  [155].  Two  similar  keyword  spotting  approaches  based  on  the  hidden  Markov  models 
(HMM)  were  proposed  in  [155]  and  [156]  for  printed  text  in  degraded  document  images, 
where  the  results  of  OCR  were  not  reliable.  The  training  of  HMMs  needs  a  large  set  of 
labeled  samples,  but  only  one  or  several  handwriting  instances  may  be  available  for  query. 
Several  handwriting  retrieval  techniques,  such  as  the  dynamic  time  warping  based  ap¬ 
proach  [154]  and  the  corner  feature  based  approach  [157],  have  been  proposed.  However, 
these  techniques  are  not  robust  under  non-rigid  deformation  of  handwriting. 

Alternatively,  we  can  consider  handwriting  as  a  shape.  Shape  presentation,  analysis, 
matching,  and  recognition  is  an  active  area  in  computer  vision  and  widely  studied  [109, 
110].  Recently,  the  shape  context  method  proposed  by  Belongie  et  al.  [24]  has  achieved 
success  in  many  shape  recognition  and  retrieval  applications.  A  shape  is  represented  as  a 
set  of  points  in  this  approach.  A  shape  context  is  assigned  to  each  point,  which  describes 
the  relative  distribution  of  the  other  points  in  the  shape.  After  defining  the  similarity 
between  two  points  based  on  their  shape  contexts,  the  Hungarian  algorithm  [126]  searches 
for  the  optimal  match  between  the  two  point  sets.  Similarity  measures,  such  as  the  shape 
context  distance  and  the  thin  plate  spline  (TPS)  bending  energy,  are  calculated  with 
the  current  estimate  of  the  correspondence  and  used  for  shape  recognition  or  retrieval. 
Ling  and  Jacobs  [158]  improved  the  original  shape  context  method  by  replacing  the 
Euclidean  distance  with  the  inner-distance,  which  is  invariant  for  articulated  shapes,  in 
the  calculation  of  shape  context.  Though  the  discrimination  power  is  improved  by  using 
the  inner-distance,  the  method  is  sensitive  to  structure  change  of  a  shape.  As  shown 
in  Fig.  43,  a  handwriting  stroke  is  often  broken.  Two  strokes  may  touch  each  other  in 
one  sample,  but  well  separated  in  another  sample.  Structure  change  is  also  a  challenge 


93 


for  shock-graph  based  approaches  [159,160],  where  the  connection  of  two  points  by  a 
stroke  is  utilized.  On  the  contrary,  by  treating  a  shape  as  a  set  of  isolated  points,  the 
original  shape  context  method  is  more  robust  under  structure  change.  In  many  shape 
retrieval  approaches  [158,161,162],  a  shape  is  often  represented  as  a  sequence  of  ordered 
points  sampled  from  its  contour.  This  may  apply  for  the  retrieval  of  on-line  handwriting 
samples  collected  on  a  PDA  or  TablePC.  However,  for  an  off-line  handwriting  sample,  it 
is  difficult  to  recover  the  temporal  information  [163].  Therefore,  we  cannot  order  points 
to  form  a  1-D  sequence  reliably  due  to  structure  change  of  handwriting. 

In  this  chapter,  we  test  several  variations  of  the  original  shape  context  method  for 
handwriting  retrieval.  The  shape  context  method  can  be  decomposed  as  two  steps:  shape 
matching  and  shape  retrieval.  Each  step  has  several  alternative  technical  solutions,  which 
are  tested  in  our  experiments.  To  avoid  confusion,  we  call  these  two  steps  of  the  original 
method  the  shape  context  matching  algorithm  and  the  shape  context  retrieval  algorithm, 
respectively.  As  discussed  in  the  previous  chapter,  the  shape  matching  part  of  the  shape 
context  algorithm  is  not  robust.  Two  neighboring  points  in  one  shape  may  be  matched 
to  two  points  far  apart  in  the  other  shape.  We  proposed  a  new  shape  matching  algorithm 
in  the  previous  chapter  to  preserve  local  neighborhood  structures  during  matching.  Ex¬ 
periments  show  our  new  approach  achieves  much  better  shape  matching  results.  In  this 
chapter,  we  replace  the  original  shape  matching  method  with  the  proposed  one  and  test 
it  for  handwriting  retrieval.  Experiments  show  that  a  slight  improvement  is  achieved. 
We  propose  a  few  new  similarity  measurements  based  on  the  point  matching  results  to 
further  improve  the  retrieval  accuracy.  The  performance  with  a  single  query  is  limited 
due  to  noise  and  the  large  variations  of  handwriting.  We  propose  a  method  to  combine 
retrieval  results  with  multiple  queries. 

7.2  Our  Handwriting  Retrieval  Method 

In  the  original  shape  context  method,  a  set  of  uniformly  sampled  points  on  the  contour 
is  used  to  represent  a  shape,  as  contour  is  a  general  representation  and  a  shape  can  be 
recovered  from  its  contour.  However,  a  contour  is  affected  by  the  stroke  width,  which 
depends  on  the  writing  tools  and  scanning  parameter  settings.  Instead,  we  use  a  skeleton, 
which  is  widely  used  in  handwriting  recognition,  to  represent  a  handwriting  sample. 
There  are  two  advantages  of  using  a  skeleton:  1)  it  is  robust  under  the  variation  in  stroke 
width;  2)  less  points  are  demanded  to  represent  handwriting.  Suppose  each  shape  is 
represented  with  N  points,  the  computation  time  is  in  the  order  of  0(N 3)  for  shape 
matching  and  0(N2)  for  similarity  calculation.  With  a  small  number  of  points,  we  can 
significantly  increase  the  retrieval  speed.  Fig.  41a  shows  the  shape  representation  with 
300  points  sampled  from  the  contour,  and  Fig.  41b  shows  the  representation  with  200 
points  from  the  skeleton.  We  can  see  that  200  points  from  the  skeleton  is  as  good  as  300 
points  from  the  contour  to  represent  a  handwriting  sample. 

After  point  matching,  two  similarity  measures  were  proposed  in  the  original  shape 
context  method  [24]  for  shape  recognition  and  retrieval.  One  is  based  on  the  shape 
context  distance  ( Vsc ),  and  the  other  is  the  TPS  bending  energy  ( V f>e).  Suppose  there 
are  M  points  in  the  template  shape  T  and  N  points  in  the  deformed  shape  D.  The 
shape  context  distance  between  shapes  T  and  D  is  the  symmetric  sum  of  shape  context 
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(b) 


Figure  41:  Handwriting  representation:  contour  vs.  skeleton,  (a)  Representation  with 
300  points  sampled  from  the  contour,  (b)  Representation  with  200  points  sampled  from 
the  skeleton. 

matching  costs  over  best  matching  points, 

Vsc  =  jj  arg  fin  c(r  W  ’  ^  arg  fif  C(T  (*)  ’ d) ’  (m) 

teT  deD 

where  T(.)  denotes  the  estimated  TPS  shape  transformation;  C(., .)  is  the  shape  context 
matching  cost  between  two  points.  Consider  two  points,  t  in  shape  T,  and  d  in  the  other 
shape  D.  Their  shape  contexts  are  ht(k)  and  hfk),  for  k  =  1,2,  respectively. 

C(t,d)  is  defined  as 

,x  _  1  [ht(k)  —  hd(k)]2 

’  ~  2  ^  ht(k)  +  hd(k ) 

The  TPS  bending  energy  represents  the  amount  of  deformation  between  two  shapes 
and  can  be  used  as  a  shape  similarity  measure.  As  discussed  in  the  previous  chapter, 
two  TPS  models  are  used  for  2-D  coordinate  transformation.  T>be  is  defined  as  the  sum 
of  the  bending  energy  of  two  TPS  models. 

Besides  Dsc  and  Vbe  proposed  in  the  original  shape  context  method,  we  can  define 
more  similarity  measures  to  improve  the  retrieval  accuracy.  The  TPS  bending  energy 
only  measures  the  energy  of  deformation  beyond  an  affine  transformation.  In  other  words, 
the  bending  energy  is  zero  for  an  affine  transformation.  An  affine  transformation  can 
be  decomposed  as  translation,  rotation,  and  anisotropic  scales.  According  to  Kendell’s 
shape  theory  [164]  [145,  p.  1],  shape  is,  “all  the  geometrical  information  that  remains 
when  location,  scale  and  rotational  effects  are  filtered  out  from  an  object.”  So,  shape  is 
invariant  up  to  the  similarity  transformation  (translation,  rotation,  and  isotropic  scale). 
The  amount  of  anisotropic  scales  is  a  good  measure  of  the  similarity  between  two  shapes. 
6  Suppose  the  scales  of  the  x  and  y  coordinates  are  Sx  and  Sy,  respectively.  We  can 
estimate  the  scales  from  the  affine  transformation  matrix  based  on  the  singular  value 
decomposition.  A  2-D  affine  transformation  can  be  represented  as  a  2  x  2  matrix  A  and 
a  2  x  1  translation  vector  T 

)='4(y)+T'  (133) 

6  This  similarity  measure  was  implemented  in  the  Matlab  package  of  the  shape  context  method  dis¬ 
tributed  on-line.  But  its  effectiveness  in  shape  recognition  and  retrieval  has  not  been  tested  and  docu¬ 
mented  in  the  paper  [24] . 
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Sx  and  Sy  are  the  singular  values  of  matrix  A.  The  distance  T>af  is  dehned  as 


T>af  =  log 


/ ma x(Sx,  Sy)  \ 

V  nhn(5'x,  Sy)  )  ' 


(134) 


If  the  scales  are  isotropic,  Sx  =  Sy,  then  Vaf  =  0. 

We  propose  another  distance  measure,  T>re,  based  on  the  registration  residual  errors. 
Suppose  point  ti  in  shape  T  is  matched  to  point  df p)  in  shape  D.  Vre  is  dehned  as 


V 

LSre 


H^(^)  df(i) 


(135) 


where  |j.j|  is  the  Euclidean  distance.  In  calculating  Vre ,  we  remove  those  points  matching 
to  a  dummy  point  nil. 

Our  matching  algorithm  will  automatically  reject  some  points  as  outliers.  If  two 
shapes  are  similar,  the  number  of  rejected  points  will  be  small,  so  the  number  of  outliers 
offers  a  measure  of  the  similarity  between  two  shapes.  The  distance  measure  T>ou  is 
dehned  as  the  number  of  outliers  rejected  during  matching. 

In  total,  we  have  hve  similarity  measures,  which  have  different  scales.  The  overall 
similarity  measure  T>au  is  dehned  as  the  weighted  Euclidean  distance  of  all  distance  mea¬ 
sures.  Given  a  query,  we  calculate  its  distance  measures  to  all  samples  in  the  database. 
We,  then,  can  calculate  the  variance  of  each  measure.  The  overall  similarity  measure  T>au 
is  define  as 


T^aii  — 


V , 


+ 


V. 


be 
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af 
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Vn 


V  ar(Vsc)  Var(Vbe)  Var(Vaf)  V  ar(Vre)  Var(Vou) 


(136) 


Since  the  weights  are  calculated  on-line,  different  queries  may  have  different  combination 
weights.  In  general,  we  do  not  have  many  samples  to  train  the  weights  for  a  retrieval 
task,  and  the  above  scheme  is  by  no  means  optimal.  Commonly,  user’s  feedback  is  often 
used  to  adjust  the  weights  in  the  retrieval  community.  This  process  is  called  relevance 
feedback,  which  may  signihcantly  increase  the  retrieval  accuracy  after  a  few  iterations. 

The  retrieval  performance  with  a  single  query  instance  is  limited  due  to  the  noise  and 
the  large  variations  of  handwriting  (as  shown  in  Figs.  42b  and  43).  The  performance  of 
a  retrieval  depends  heavily  on  the  sample  used  for  query.  It  is  often  possible  to  have  a 
couple  of  handwritten  samples  from  the  same  person  available  for  retrieval.  For  example, 
we  can  ask  the  person  to  write  a  sample  several  times,  or  we  can  select  the  correctly 
returned  samples  from  the  first-round  retrieval  and  use  them  with  the  original  query 
instance  in  the  second-round  retrieval.  Suppose,  multiple  instances  from  the  same  class 
qi,  q2,  ■  ■  • ,  qk  are  used  as  queries.  The  corresponding  distances  of  a  handwritten  sample 
in  the  collection  to  these  queries  are  di,  o?2, . . . ,  d^.  We  combine  them  into  a  final  distance 


d  =  f(di,d2,  •  •  •  ,dk).  (137) 

Using  one  instance  for  query,  we  can  achieve  a  sorted  list  of  all  samples  in  the  collection. 
Using  multiple  instances,  multiple  sorted  lists  of  the  same  collection  can  be  achieved. 
The  problem  of  combining  these  sorted  lists  to  get  the  final  one  is  similar  to  the  multiple 
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classifier  combination  problem,  which  is  an  active  research  topic  in  pattern  recognition. 
Many  techniques  have  been  proposed.  Currently  we  use  a  simple  technique:  using  the 
smallest  distance  as  the  combined  distance  measure 

d  =  min{di,d,2,  ■  ■  ■  ,dk}-  (138) 


7.3  Experiments 

In  this  section,  we  first  present  our  retrieval  evaluation  metrics.  The  effects  of  different 
shape  representations,  shape  matching  algorithms,  and  similarity  measures  are  compared 
in  the  task  of  handwritten  initial  retrieval.  We  also  present  an  experiment  for  logo 
retrieval. 


7.3.1  Evaluation  Metrics 


Evaluation  metrics  for  retrieval  are  well  studied  in  the  traditional  information  retrieval 
literature.  Generally,  two  set  of  metrics,  precision  and  recall,  are  widely  used. 


Precision 


#  Returned  relevant  samples 
#  Returned  samples 


(139) 


Recall 


#  Returned  relevant  samples 
#  Relevant  samples  in  the  database 


(140) 


Region  of  characteristic  (ROC)  curves,  indicating  the  relationship  between  precision  and 
recall,  can  release  the  performance  of  a  retrieval  algorithm  completely.  Since  comparing 
ROC  curves  of  two  algorithms  is  complicated,  sometimes,  a  metric  with  a  scalar  value  is 
preferred.  R-Precision  is  such  a  metric.  Suppose  a  total  of  M  samples  in  the  database, 
and  among  them  N  samples  are  relevant  for  a  query.  The  retrieval  algorithm  will  return  a 
ranked  list  of  all  M  samples.  R-Precision  is  the  precision  of  the  first  N  retrieved  samples 
in  the  list. 


^  „  #  Relevant  samples  in  top  N  returned  samples  .  ,  . 

R-Precision  =  — - - - - - -  (141) 


R-Precision  is  used  for  evaluation  in  the  following  experiments. 


7.3.2  Handwritten  Initial  Retrieval 

To  test  the  effectiveness  of  the  proposed  method  for  handwriting  retrieval,  we  collected 
a  small  dataset  of  handwritten  initials.  We  had  eight  persons  with  each  of  them  writing 
40  sets  of  initials.  Fig.  42a  shows  one  sample  from  each  person,  and  42b  shows  several 
samples  from  one  person.  As  we  can  see,  variations  (besides  rotation  and  scales)  are  large 
for  handwriting,  and  noise,  such  as  underlines,  would  make  things  worse. 

In  the  first  experiment,  we  evaluate  the  effect  of  different  shape  representations  on 
retrieval  accuracy.  For  a  general  shape,  a  contour  is  often  use  to  represent  a  shape,  then 
a  number  of  points  are  uniformly  sampled  from  the  contour.  Handwriting  is  composed 
of  thin  strokes,  so  skeleton  representation  may  offer  a  better  choice  than  contour.  In  this 
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Figure  42:  Some  samples  from  the  handwritten  initial  data  set.  (a)  One  sample  from 
each  person,  (b)  Several  samples  from  one  person. 


experiment,  we  use  the  original  shape  context  algorithm  for  experiments:  the  original 
shape  context  matching  method  is  used  to  search  for  the  correspondence  between  two 
point  sets;  Vsc  and  Dbe  are  used  to  measure  the  similarity  between  two  shapes.  The 
number  of  sampled  points  is  fixed  as  200.  For  an  exhaustively  evaluation  on  the  database, 
each  sample  should  be  selected  once  for  query.  Supposing  M  samples  in  the  database,  the 
number  of  shape  comparisons  is  M  x  (M  —  1).  In  our  case,  it  is  320  x  319  =  102,  080.  Each 
shape  comparison  may  take  about  one  second.  Due  to  the  speed,  we  randomly  select  10% 
of  the  samples  as  queries.  It  is  sufficient  to  reveal  the  difference  between  different  shape 
representations.  To  remove  any  bias,  the  selected  sample  is  removed  from  the  database 
for  this  query.  The  overall  R-Precision  is  reported  as  the  average  of  R-Precisions  of 
all  queries.  The  overall  R-Precision  of  the  contour  based  representation  is  69.4%.  For 
comparison,  two  skeleton  extraction  algorithms  are  implemented  [165,166].  The  overall 
R-Precision  increases  slightly  to  70.5%  using  Dyer  and  Rosefled’s  method  [165],  and 
to  71.4%  using  Zhang  and  Suen’s  method  [166].  Since  the  latter  skeleton  extraction 
algorithm  achieves  the  best  performance,  we  use  it  for  the  following  experiments. 

In  this  experiment,  we  test  the  effectiveness  of  our  shape  matching  method.  We  re¬ 
place  the  matching  method  with  our  approach  presented  in  the  last  chapter,  and  use  the 
same  similarity  measures  T>sc  and  T>be.  The  same  10%  randomly  selected  samples  in  the 
previous  experiment  are  used  as  queries.  The  overall  R-Precision  using  our  shape  match¬ 
ing  approach  is  72.3%.  Out  of  our  expectation,  the  improvement  is  slight  compared  to  the 
original  overall  R-Precision  of  71.4%.  Detailed  analysis  shows  that  our  shape  matching 
approach  decreases  the  average  distances  between  shapes  from  the  same  category.  This 
verifies  that  better  shape  matching  results  are  achieved  by  our  approach.  However,  our 
matching  approach  also  decreases  the  average  distances  between  shapes  from  different 
categories.  The  conclusion  of  this  experiment  is  that  a  better  shape  matching  algorithm 
does  not  necessarily  increase  the  discrimination  capability  of  a  measure,  which  is  defined 
on  the  matching  results. 

In  the  following  experiment,  we  compare  the  effectiveness  of  different  similarity  mea¬ 
sures.  Unlike  the  previous  experiments,  we  make  full  use  of  the  database.  All  samples 
are  used  in  turn  for  query.  Since  our  shape  matching  algorithm  is  much  slower  than  the 
shape  context  matching  algorithm,  we  use  the  original  matching  method.  Table  11  shows 
the  experimental  results.  The  most  powerful  similarity  measure  is  the  TPS  bending  en- 
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Table  11:  Query  with  different  similarity  measures. 


Vsc 

Vbe 

Vaf 

T>re 

Vou 

Vgc+Vbe 

Vall 

Overall  R-Precision 

56.1% 

60.9% 

57.3% 

50.1% 

35.5% 

71.2% 

74.4% 

Table  12:  Query  with  multiple  handwritten  initial  samples  from  the  same  person. 


Number  of  instances 

One 

Two 

Three 

Four 

Overall  R-Precision 

74.4% 

83.4% 

86.2% 

88.1% 

ergy  (Z>&e),  followed  by  the  affine  transformation  based  measure  ( T>af ).  This  experiment 
show  that  measures  based  on  transformations  (TPS  for  non-linear  transformation  and 
affine  for  linear  transformation)  are  more  robust  than  other  features.  The  remaining  mea¬ 
sures  can  be  ordered  according  to  the  discrimination  power  as  the  shape  context  distance 
(T>sc),  the  residual  error  in  registration  ( Vre ),  and  the  outlier  ratio  (Vou).  The  outlier 
ratio  feature  is  significantly  less  effective  than  the  others,  although  it  still  provides  some 
retrieval  power.  By  combining  T>be  and  T>sc  as  in  the  original  shape  context  method,  the 
overall  R-Precision  is  71.2%.  If  we  combine  all  features,  the  overall  R-Precision  improves 
to  74.4%,  demonstrating  that  different  features  are  complementary.  Fig.  43  shows  one 
query  example.  Excluding  the  query  itself  from  the  database  leaves  39  relevant  samples. 
Among  them,  nine  are  ranked  outside  the  top  39  returned  samples.  The  figure  also  shows 
the  missed  samples  and  false  alarms.  Most  false  alarms  rank  low,  just  higher  than  the 
border  line. 

Table  12  shows  the  retrieval  accuracy  using  multiple  instances  from  the  same  person. 
We  randomly  select  multiple  instances  from  a  person  and  remove  them  from  the  database. 
Each  instance  is  used  to  query  the  database.  Eq.  138  is  used  to  combine  the  overall 
distance  measures  Vau  of  multiple  queries.  As  we  can  see,  using  multiple  instances 
can  significantly  increase  the  performance.  With  more  instances  added,  the  overall  R- 
Precision  increases  steadily.  When  four  instances  are  used,  88.1%  overall  R-Precision  is 
achieved. 

7.3.3  Logo  Retrieval 

A  logo  is  important  to  identify  a  document’s  source.  Generally,  logos  can  be  seen  as  rigid 
shapes.  However,  some  organizations  make  small  adjustment  to  their  logos  periodically. 
Logos  from  different  departments  under  the  same  organization  (such  as  a  university  or  the 
federal  government)  may  contain  a  similar  layout  with  a  small  variation  to  reflect  their 
identity.  In  this  experiment,  we  consider  the  task  of  finding  all  documents  containing  a 
specified  logo  given  by  a  query.  Since  a  logo  is  often  embedded  in  a  document,  we  need 
to  segment  it  from  other  document  components.  We  use  the  automatic  logo  detection 
algorithm  developed  by  Dr.  Drayer  at  Fort  Meade,  Maryland.  The  tobacco  data  set  is 
used  for  the  experiment.  In  the  data  set,  546  documents  contain  visible  logos,  and  the 
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Query  Instance 


Figure  43:  Handwritten  initial  retrieval.  The  first  row  is  the  instance  used  for  query. 
The  middle  zone  shows  the  missed  samples,  and  the  bottom  zone  shows  the  false  alarms. 
The  rank  for  each  sample  is  shown  under  the  corresponding  image. 
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Figure  44:  Logo  retrieval.  The  top  row  is  the  instance  used  for  query.  The  following  rows 
show  the  missed  logos  in  the  top  returned  samples. 


algorithm  detected  350  of  them.  A  logo  is  claimed  to  be  detected  if  the  major  parts  of  the 
logo  are  detected,  so  in  many  cases  the  so-called  correct  detection  is  not  perfect.  A  small 
part  of  a  logo  may  be  lost,  and  other  components  may  be  grouped  into  the  detected  logo. 
We  use  these  350  detected  logos  for  the  experiment  of  logo  retrieval.  There  are  23  logo 
categories,  and  some  may  contain  sub-categories  (variations).  The  distribution  of  each 
category  is  not  even.  About  half  of  the  logos,  179  instances,  belong  to  one  category.  We 
randomly  selected  one  instance  from  this  category  as  a  query,  as  shown  in  the  first  row  of 
Fig.  44.  Since  skeleton  representation  is  sensitive  to  noise  for  a  general  shape,  we  use  200 
points  uniformly  sampled  from  the  contour  to  represent  a  logo.  All  similarity  measures 
are  used,  and  the  overall  distance  measure  T>au  is  used  to  rank  the  remaining  349  logos. 
Thirty-three  of  the  relevant  logos  rank  outside  the  top  178,  resulting  in  the  corresponding 
R-Precision  of  81.5%.  Fig.  44  shows  several  relevant  logos  that  rank  outside  the  top  178. 
Most  missed  logos  occur  because  of  the  segmentation  errors.  In  some  cases,  the  approach 
may  fail  due  to  variations  in  the  logo  design,  as  shown  in  the  last  row  of  Fig.  44. 

To  fully  evaluate  the  performance  in  logo  retrieval,  we  use  each  instance  in  the  col¬ 
lection  for  query.  Table  7.3.3  shows  the  overall  R-Precision  using  different  similarity 
measures.  The  overall  R-Precision  is  67.9%.  As  demonstrated  in  the  table,  combining 
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Table  13:  Overall  R-Precision  for  logo  retrieval  with  different  similarity  measures. 


Dsc 

T^be 

Vaf 

V>re 

Dou 

'Dali 

Overall  R-Precision 

53.8% 

57.8% 

54.6% 

48.3% 

30.3% 

67.9% 

different  similarity  measures  can  significantly  improve  the  retrieval  accuracy. 

7.4  Summary  and  Future  Work 

Experiments  show  that  shape  context  is  effective  for  handwriting  retrieval.  Skeleton 
representation  is  more  suitable  for  handwriting  than  contour  and  improves  the  retrieval 
performance.  Our  new  shape  matching  method  only  slightly  improves  the  retrieval  ac¬ 
curacy,  since  it  simultaneously  decreases  the  distance  measures  between  two  shapes  from 
both  the  same  and  different  categories.  Overall,  the  distinguishing  power  of  a  distance 
measure  only  improves  slightly  using  our  matching  method.  Adding  more  similarity  mea¬ 
sures  will  improve  the  retrieval  accuracy  significantly.  A  more  effective  way  to  improve 
the  accuracy  is  to  use  multiple  samples  for  query.  When  four  instances  are  used  for  query, 
the  overall  R-Precision  increases  from  74.4%  to  88.1%  in  our  handwritten  initial  retrieval 
experiment. 

Handwriting  retrieval  (non-rigid  shape  retrieval  in  general)  is  a  difficult  problem.  In 
this  chapter,  we  presented  only  our  preliminary  efforts  on  this  topic.  Much  work  remains 
for  the  future.  We  tried  to  combine  our  previous  work  on  handwriting  identification  with 
the  handwriting  retrieval  proposed  in  this  chapter  into  a  complete  system.  However, 
segmentation  errors  significantly  deteriorated  the  overall  retrieval  performance.  This  is 
still  an  open  problem  in  computer  vision  and  needs  further  investigation.  Future  research 
may  also  look  into  the  optimal  combination  scheme  for  different  distance  measures.  Rel¬ 
evance  feedback  is  a  solution,  but  the  effectiveness  of  this  approach  needs  to  be  verified 
in  the  context  of  handwriting  retrieval  by  experiments. 
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8  Conclusions 


8.1  Summary 

Our  work  has  centered  around  the  ability  to  separate  handwriting  from  other  layers  in 
noisy  documents.  Following  are  our  key  contributions. 

1.  Many  handwritten  documents  have  rule  lines  as  a  background  pattern.  The  lines 
must  be  detected  and  removed  before  feeding  the  text  to  an  optical  character  recog¬ 
nition  (OCR)  engine.  Severely  broken  rule  lines  present  a  great  challenge  for  ex¬ 
isting  line  detection  algorithms.  We  proposed  an  HMM  model  to  incorporate  the 
constraints  among  a  set  of  parallel  lines.  Our  method  is  fast  and  achieves  both 
a  high  accuracy  and  a  low  false  alarm  rate.  It  has  been  tested  on  a  real  Arabic 
data  set,  and  promising  results  were  achieved.  After  line  detection,  line  removal  is 
performed  by  line  width  thresholding. 

2.  For  handwriting  identification  in  noisy  documents,  we  proposed  a  classification 
based  scheme.  The  input  document  is  segmented  at  the  word  level.  Several  fea¬ 
tures,  including  structures,  Gabor  filters,  run-length  histograms,  crossing-count  his¬ 
tograms,  and  textures,  are  extracted  for  classification.  The  classification  result  is 
reasonable,  with  a  few  mis-classifications  due  to  the  overlapping  of  different  classes 
in  the  feature  space.  Some  other  cues  may  refine  the  classification  results.  For 
example,  machine  printed  text,  handwriting,  and  noise  exhibit  different  patterns  of 
geometric  relationships.  Printed  words  often  form  horizontal  (or  vertical)  text  lines, 
and  noise  blocks  tend  to  overlap  each  other.  The  novelty  of  our  approach  involves 
using  the  Markov  random  field  (MRF)  to  model  the  geometric  relationship  among 
neighboring  blocks.  Experiments  show  that  MRF  based  post-processing  is  effective, 
where  almost  half  of  the  mis-classifications  are  corrected  after  post-processing. 

3.  The  identified  handwriting  may  be  further  analyzed.  In  this  dissertation,  we  pro¬ 
posed  a  novel  point  pattern  based  handwriting  matching  technique  and  applied 
it  for  handwriting  synthesis  and  retrieval.  We  studied  handwriting  matching  in  a 
broader  context  of  nonrigid  shape  matching.  For  nonrigid  shapes,  most  neighboring 
points  cannot  move  independently  under  deformation  due  to  physical  constraints. 
Therefore,  though  the  absolute  distance  between  two  points  may  change  signifi¬ 
cantly,  the  neighborhood  of  a  point  is  well  preserved  in  general.  Based  on  this 
observation,  we  formulate  point  matching  as  a  graph  matching  problem.  Each 
point  is  a  node  in  the  graph,  and  two  nodes  are  connected  by  an  edge  if  their  Eu¬ 
clidean  distance  is  less  than  a  threshold.  The  optimal  match  between  two  graphs  is 
the  one  that  maximizes  the  number  of  matched  edges.  The  shape  context  distance 
is  used  to  initialize  the  graph  matching,  followed  by  relaxation  labeling  for  refine¬ 
ment.  Experiments  demonstrate  the  effectiveness  of  our  approach:  it  outperforms 
the  shape  context  and  TPS-RPM  algorithms  under  nonrigid  deformation  and  noise 
on  a  public  data  set. 

4.  The  techniques  proposed  in  this  paper  are  not  limited  to  the  processing  of  handwrit¬ 
ing  documents.  For  example,  our  model-based  line  detection  algorithm,  proposed  in 
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Chapter  2,  can  extend  straightforwardly  for  known  form  identification  and  registra¬ 
tion.  In  Chapter  4,  we  separate  different  components  into  three  layers,  handwriting, 
machine  printed  text,  and  noise.  By  separating  noise,  the  layer  of  machine  printed 
text  is  much  cleaner  than  the  original  noisy  document,  which  facilitates  further 
processing,  such  as  zone  segmentation  and  OCR.  Our  approach  for  handwriting 
matching,  discussed  in  Chapter  5,  is  general  enough  to  be  applied  to  other  point 
pattern  based  nonrigid  shape  matching  applications. 

8.2  Future  Work 

Though  promising  results  have  been  achieved  on  several  key  issues  in  processing  of  hand¬ 
writing  documents,  many  possible  extensions  for  further  improvement  exist. 

1.  In  Chapter  4,  how  to  extend  our  handwriting  identification  method  to  cursive 
scripts,  such  as  Arabic,  is  under  investigation.  We  found  two  observations  used  to 
discriminate  handwriting  from  machine  printed  text  for  English  documents  do  not 
work  well  for  Arabic  documents.  (1)  Handwriting  is  more  cursive  than  machine 
printed  text  in  English  documents.  However,  machine  printed  Arabic  text  is  cur¬ 
sive  by  nature.  (2)  People  like  to  connect  several  neighboring  characters  during 
writing.  However,  machine  printed  Arabic  characters  are  often  connected  too.  Pre¬ 
liminary  experiments  on  Arabic  documents,  using  the  same  feature  set  proposed 
in  Chapter  4,  resulted  in  a  poor  classification  accuracy.  One  possible  solution  is  to 
design  new  features  for  Arabic  documents.  Alternatively,  we  can  perform  handwrit¬ 
ing/machine  printed  text  discrimination  at  a  higher  level  than  word  blocks,  such 
as  the  text  line  level.  Reliably  extracting  text  lines  from  a  heterogeneous  and  noisy 
document  is  a  challenging  problem  itself.  I  am  collaborating  with  another  PhD 
student  on  this  topic.  Preliminary  results  are  promising,  but  more  experiments 
are  necessary.  Another  problem  is  that  word  level  classification  is  still  demanded 
in  real  applications  because  short  text  lines  may  contain  only  one  or  two  words. 
How  to  combine  the  classification  results  at  word  and  text  line  levels  presents  one 
direction  for  future  research. 

2.  For  nonrigid  shape  matching  in  Chapter  5,  large  outlier  or  occlusion  ratios  (espe¬ 
cially  if  the  occlusion  breaks  a  shape  into  several  isolated  parts)  can  significantly 
change  the  local  neighborhood  structures.  Combination  of  different  sources  of 
degradation,  such  as  large  rotation,  noise,  and  occlusion,  also  presents  a  challenge, 
which  should  be  addressed  in  future  research.  In  this  dissertation,  the  relaxation 
labeling  method  is  used  to  solve  the  constrained  optimization  problem.  Converging 
only  to  a  local  optimum,  it  is  by  no  means  the  best  approach. 

3.  In  Chapter  7,  our  experiments  on  handwriting  retrieval  show  that  a  better  shape 
matching  method  does  not  always  result  in  a  higher  retrieval  accuracy.  Combin¬ 
ing  several  robust  similarity  measures  is  more  effective  to  improve  the  retrieval 
accuracy.  How  to  get  the  optimal  combination  weights  for  different  measures  is  a 
topic  under  investigation.  Techniques  in  the  literature  of  traditional  information 
retrieval,  such  as  relevance  feedback  or  clustering  analysis,  should  be  studied  in 
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the  context  of  handwriting  retrieval  to  test  their  effectiveness.  Segmentation  errors 
will  significantly  deteriorate  the  overall  retrieval  performance.  This  is  still  an  open 
problem  in  computer  vision  and  needs  further  investigation. 
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